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Preface 



The Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 
has been held every year since 1997. This year, the eighth in the series (PAKDD 
2004) was held at Carlton Crest Hotel, Sydney, Australia, 26-28 May 2004. 
PAKDD is a leading international conference in the area of data mining. It pro- 
vides an international forum for researchers and industry practitioners to share 
their new ideas, original research results and practical development experiences 
from all KDD-related areas including data mining, data warehousing, machine 
learning, databases, statistics, knowledge acquisition and automatic scientific 
discovery, data visualization, causal induction, and knowledge-based systems. 

The selection process this year was extremely competitive. We received 238 
research papers from 23 countries, which is the highest in the history of PAKDD, 
and reflects the recognition of and interest in this conference. Each submitted 
research paper was reviewed by three members of the program committee. Fol- 
lowing this independent review, there were discussions among the reviewers, and 
when necessary, additional reviews from other experts were requested. A total 
of 50 papers were selected as full papers (21%), and another 31 were selected as 
short papers (13%), yielding a combined acceptance rate of approximately 34%. 

The conference accommodated both research papers presenting original in- 
vestigation results and industrial papers reporting real data mining applications 
and system development experience. The conference also included three tutorials 
on key technologies of knowledge discovery and data mining, and one workshop 
focusing on specific new challenges and emerging issues of knowledge discovery 
and data mining. The PAKDD 2004 program was further enhanced with keynote 
speeches by two outstanding researchers in the area of knowledge discovery and 
data mining: Philip Yu, Manager of Software Tools and Techniques, IBM T.J. 
Watson Research Center, USA and Usama Fayyad, President of DMX Group, 
LLC, USA. 

We would like to thank everyone who participated in the development of 
the PAKDD 2004 program. In particular, we would give special thanks to the 
Program Committee. We asked a lot from them and were repeatedly impressed 
with their diligence and deep concern for the quality of the program, and also 
with their detailed feedback to the authors. We also thank Graham Williams for 
chairing the Industry Portal, Chinya V. Ravishankar and Bing Liu for chairing 
the workshop program, and Kai Ming Ting and Sunita Sarawagi for chairing the 
tutorials program. We are very grateful to Gang Li who put a lot of effort into 
the PAKDD 2004 Web site. We also thank Hongjun Lu, Hiroshi Motoda, and 
the other members of the PAKDD Steering Committee for providing valuable 
support and advice on many issues. The task of selecting papers was facilitated 
by Microsoft’s Conference Management Tool, and by the tireless efforts of Mark 
Lau. The general organization of the conference also relied on the efforts of 
many. Simeon J. Simoff did an impressive job of ensuring that the logistics for 
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the conference site were executed flawlessly. Li Liu and Guangquan Zhang made 
sure that the registration process flowed smoothly. Sanjay Chawla successfully 
publicized the conference and Peter O’Hanlon obtained sponsorships for the 
conference. We would also like to thank our financial sponsors SAS, UTS, Deakin 
University, and NICTA. 

Finally and most importantly, we thank all the authors, who are the primary 
reason why the PAKDD conference continues to be so exciting, and to be the 
foremost place to learn about advances in both theoretical and applied research 
on KDD. Because of your work, PAKDD 2004 was a great success. 

May 2004 Honghua Dai, Ramakrishnan Srikant 

Chengqi Zhang, Nick Cercone 
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Mining of Evolving Data Streams with Privacy 

Preservation 



Philip S. Yu 

Manager, Software Tools and Techniques 
IBM T.J. Watson Research Center, USA 
psyuSus . ibm . com 



Abstract. The data stream domain has become increasingly important 
in recent years because of its applicability to a wide variety of appli- 
cations. Problems such as data mining and privacy preservation which 
have been studied for traditional data sets cannot be easily solved for the 
data stream domain. This is because the large volume of data arriving 
in a stream renders most algorithms to inefficient as most mining and 
privacy preservation algorithms require multiple scans of data which is 
unrealistic for stream data. More importantly, the characteristics of the 
data stream can change over time and the evolving pattern needs to be 
captured. In this talk. I’ll discuss the issues and focus on how to mine 
evolving data streams and preserve privacy. 
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Data Mining Grand Challenges 



Usama Fayyad 

President, DMX Group, LLC, USA 
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Abstract. The past two decades has seen a huge wave of computational 
systems for the “digitization” of business operations from ERP, to manu- 
facturing, to systems for customer interactions. These systems increased 
the throughput and efficiency of conducting “transactions” and resulted 
in an unprecedented build-up of data captured from these systems. The 
paradoxical reality that most organizations face today is that they have 
more data about every aspect of their operations and customers, yet they 
find themselves with an ever diminishing understanding of either. Data 
Mining has received much attention as a technology that can possibly 
bridge the gap between data and knowledge. 



While some interesting progress has been achieved over the past few years, es- 
pecially when it comes to techniques and scalable algorithms, very few organiza- 
tions have managed to benefit from the technology. Despite the recent advances, 
some major hurdles exist on the road to the needed evolution. Furthermore, most 
technical research work does not appear to be directed at these challenges, nor 
does it appear to be aware of their nature. This talk will cover these challenges 
and present them in both the technical and the business context. The exposition 
will cover deep technical research questions, practical application considerations, 
and social/economic considerations. The talk will draw on illustrative examples 
from scientific data analysis, commercial applications of data mining in under- 
standing customer interaction data, and considerations of coupling data mining 
technology within database management of systems. Of particular interest is the 
business challenge of how to make the technology really work in practice. There 
are many unsolved deep technical research problems in this field and we conclude 
by covering a sampling of these. 
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Evaluating the Replicability of Significance Tests for 
Comparing Learning Algorithms 



Remco R. Bouckaert^’^ and Eibe Frank^ 
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215 Three Oaks Drive, Dairy Flat, Auckland, New Zealand 
rrbOxm . co . nz 

^ Computer Science Department, University of Waikato 
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Abstract. Empirical research in learning algorithms for classification tasks gen- 
erally requires the use of significance tests. The quality of a test is typically judged 
on Type I error (how often the test indicates a difference when it should not) and 
Type II error (how often it indicates no difference when it should). In this paper 
we argue that the replicability of a test is also of importance. We say that a test has 
low replicability if its outcome strongly depends on the particular random parti- 
tioning of the data that is used to perform it. We present empirical measures of 
replicability and use them to compare the performance of several popular tests in 
a realistic setting involving standard learning algorithms and benchmark datasets. 
Based on our results we give recommendations on which test to use. 



1 Introduction 



Significance tests are often applied to compare performance estimates obtained by re- 
sampling methods — such as cross-validation [1] — that randomly partition data. In this 
paper we consider the problem that a test may be very sensitive to the particular random 
partitioning used in this process. If this is the case, it is possible that, using the same data, 
the same learning algorithms A and B, and the same significance test, one researcher 
finds that method A is preferable, while another finds that there is not enough evidence 
for this. Lack of replicability can also cause problems when “tuning” an algorithm: a test 
may judge favorably on the latest modification purely due to its sensitivity to the par- 
ticular random number seed used to partition the data. In this paper we extend previous 
work on replicability [2,3] by studying the replicability of some popular tests in a more 
realistic setting based on standard benchmark datasets taken from the UCI repository of 
machine learning problems [4] . 

The structure of the paper is as follows. In Section 2 we review how significance 
tests are used for comparing learning algorithms and introduce the notion of replicability. 
Section 3 discusses some popular tests in detail. Section 4 contains empirical results for 
these tests and highlights the lack of replicability of some of them. Section 5 summarizes 
the results and makes some recommendations based on our empirical findings. 
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2 Evaluating Significance Tests 

We consider a scenario where we have a certain application domain and we are interested 
in the mean difference in accuracy between two classification algorithms in this domain, 
given that the two algorithms are trained on a dataset with N instances. We do not 
know the joint distribution underlying the domain and consequently cannot compute the 
difference exactly. Hence we need to estimate it, and, to check whether the estimated 
difference is likely to be a “true” difference, perform a significance test. To this end we 
also need to estimate the variance of the differences across different training sets. 

Obtaining an unbiased estimate of the mean and variance of the difference is easy if 
there is a sufficient supply of data. In that case we can sample a number of training sets of 
size N, run the two learning algorithms on each of them, and estimate the difference in 
accuracy for each pair of classifiers on a large test set. The average of these differences is 
an estimate of the expected difference in generalization error across all possible training 
sets of size N, and their variance is an estimate of the variance. Then we can perform 
a paired f-test to check the null hypothesis that the mean difference is zero. The Type I 
error of a test is the probability that it rejects the null hypothesis incorrectly (i.e. it finds 
a “significant” difference although there is none). Type II error is the probability that the 
null hypothesis is not rejected when there actually is a difference. The test’s Type I error 
will be close to the chosen significance level a. 

In practice we often only have one dataset of size N and all estimates must be 
obtained from this one dataset. Different training sets are obtained by subsampling, and 
the instances not sampled for training are used for testing. For each training set Si, 
1 < i < fc, we get a matching pair of accuracy estimates and the difference Xi. The 
mean and variance of the differences Xi is used to estimate the mean and variance of 
the difference in generalization error across different training sets. Unfortunately this 
violates the independence assumption necessary for proper significance testing because 
we re-use the data to obtain the different Xi’s. The consequence of this is that the Type 
I error exceeds the significance level. This is problematic because it is important for the 
researcher to be able to control the Type I error and know the probability of incorrectly 
rejecting the null hypothesis. Several heuristic versions of the t-test have been developed 
to alleviate this problem [5,6]. 

In this paper we study the replicability of significance tests. Consider a test based 
on the accuracy estimates generated by cross-validation. Before the cross-validation 
is performed, the data is randomized so that each of the resulting training and test sets 
exhibits the same distribution. Ideally, we would like the test’s outcome to be independent 
of the particular partitioning resulting from the randomization process because this would 
make it much easier to replicate experimental results published in the literature. However, 
in practice there is always a certain sensitivity to the partitioning used. To measure 
replicability we need to repeat the same test several times on the same data with different 
random partitionings — in this paper we use ten repetitions — and count how often the 
outcome is the same. Note that a test will have greater replicability than another test 
with the same Type I and Type II error if it is more consistent in its outcomes for each 
individual dataset. 

We use two measures of replicability. The first measure, which we call consistency, 
is based on the raw counts. If the outcome is the same for every repetition of a test 
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on the same data, we call the test consistent, and if there is a difference at most once, 
we call it almost consistent. This procedure is repeated with multiple datasets, and the 
fraction of outcomes for which a test is consistent or almost consistent is an indication 
of how replicable the test is. The second measure, which we call replicability, is based 
on the probability that two runs of the test on the same data set will produce the same 
outcome. This probability is never worse than 0.5 . To estimate it we need to consider pairs 
of randomizations. If we have performed the test based on n different randomizations 
for a particular dataset then there are ( 2 ) such pairs. Assume the tests rejects the null 
hypothesis for k {Q < k < n) of the randomizations. Then there are ( 2 ) rejecting 
pairs and (" 2 ^) accepting ones. Based on this the above probability can be estimated 

as R{k,n) = (( 2 ) + . We use this probability to 

form a measure of replicability across different datasets. Assume there are m datasets 
and let ik (0 ^ k < n) be the number of datasets for which the test agrees k times (i.e. 
Sfe=o Then we define replicability as i? = X]fe=o ~R{k, n). The larger the 

value of this measure, the more likely the test is to produce the same outcome for two 
different randomizations of a dataset. 



3 Significance Tests 

In this section we review some tests for comparing learning algorithms. Although testing 
is essential for empirical research, surprisingly little has been written on this topic. 

3.1 The 5x2cv Paired f-Test 

Dietterich [5] evaluates several significance tests by measuring their Type I and Type II 
error on artificial and real-world data. He finds that the paired f-test applied to random 
subsampling has an exceedingly large Type I error. In random subsampling a training 
set is drawn at random without replacement and the remainder of the data is used for 
testing. This is repeated a given number of times. In contrast to cross-validation, random 
subsampling does not ensure that the test sets do not overlap. Ten-fold cross-validation 
can be viewed as a special case of random subsampling repeated ten times, where 90% 
of the data is used for training, and it is guaranteed that the ten test sets do not overlap. 
The paired f-test based on ten-fold cross-validation fares better in the experiments in [5] 
but also exhibits an inflated Type I error. On one of the real-world datasets its Type I 
error is approximately twice the significance level. 

As an alternative [5] proposes a heuristic test based on five runs of two-fold cross- 
validation, called “5x2cv paired f-test”. In an r-times fc-fold cross-validations there are 
r, r > 1, runs and k, k > 1, folds. For each run j, l<j< r, the data is randomly 
permutated and split into k subsets of equal size. ' We call these i,l < i < k, subsets the k 
folds of run j. We consider two learning schemes A and B and measure their respective 
accuracies atj and 6^ for fold i and run j. To obtain and 6^ the corresponding 
learning scheme is trained on all the data excluding that in fold i of run j and tested on 

* Of course, in some cases it may not be possible to split the data into subsets that have exactly 
the same size. 
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the remainder. Note that exactly the same pair of training and test sets is used to obtain 
both aij and 6^ . That means a paired significance test is appropriate and we can consider 
the individual differences in accuracy Xij = Uij — bij as the input for the test. 

Let x,j denote the mean difference for a single run of 2-fold cross-validation, x,j = 
{xij + Xij)j2. The variance is (Tj = (xij — Xj)'^ + {x 2 j — The 5x2cv paired 
f-test uses the following test statistic: 



This statistic is plugged into the Student-t distribution with five degrees of freedom. Note 
that the numerator only uses the term xn and not the other differences Xij . Consequently 
the outcome of the test is strongly dependent on the particular partitioning of the data 
used when the test is performed. Therefore it can be expected that the replicability of 
this test is not high. Our empirical evaluation demonstrates that this is indeed the case. 

The empirical results in [5] show that the 5 x 2cv paired f-test has a Type I error at 
or below the significance level. However, they also show that it has a much higher Type 
II error than the standard f-test applied to ten-fold cross-validation. Consequently the 
former test is recommended in [5] when a low Type I error is essential, and the latter 
test otherwise. 

The other two tests evaluated in [5] are McNemar’s test and the test for the differ- 
ence of two proportions. Both of these tests are based on a single train/test split and 
consequently cannot take variance due to the choice of training and test set into account. 
Of these two tests, McNemar’s test performs better overall: it has an acceptable Type I 
error and the Type II error is only slightly lower than that of the 5 x 2cv paired f-test. 
However, because these two tests are inferior to the 5 x 2cv test, we will not consider 
them in our experiments. 

3.2 Tests Based on Random Subsampling 

As mentioned above, Dietterich [5] found that the standard f-test has a high Type I error 
when used in conjunction with random subsampling. Nadeau and Bengio [6] observe that 
this is due to an underestimation of the variance because the samples are not independent 
(i.e. the different training and test sets overlap). Consequently they propose to correct 
the variance estimate by taking this dependency into account. 

Let Gj and bj be the accuracy of algorithms A and B respectively, measured on 
run j (1 < j < n). Assume that in each run rii instances are used for training, and 
the remaining ri 2 instances for testing. Let xj be the difference Xj = aj — bj, and jl 
and (7^ the estimates of the mean and variance of the n differences. The statistic of the 
“corrected resampled t-test” is: 



This statistic is used in conjunction with the Student-r distribution and n — 1 degrees of 
freedom. The only difference to the standard r-test is that the factor ^ in the denominator 



t = 
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has been replaced by the factor ^ ^ - Nadeau and Bengio [6] suggest that normal 

usage would call for ni to be 5 or 10 times larger than ri 2 , 

Empirical results show that this test dramatically improves on the standard resampled 
r-test: the Type I error is close to the significance level, and, unlike McNemar’s test and 
the 5 X 2c V test, it does not suffer from high Type II error [6]. 

3.3 Tests Based on Repeated k-Fold Cross Validation 

Here we consider tests based on r-times fc-fold cross-validation where r and k can have 
any value. As in Section 3.1, we observe differences Xij = atj — b^j for fold i and 
run j. One could simply use m = -^ J2i=i X)y=i ^ij estimate for the mean and 

as an estimate for the variance. Then, assuming the 
various values of Xij are independent, the test statistic t = m/ \/ (\jk.r)cr^ is distributed 
according to a f-distribution with df = k.r — 1 degrees of freedom. Unfortunately, the 
independence assumption is highly flawed, and tests based on this assumption show very 
high Type I error, similar to plain subsampling. 

However, the same variance correction as in the previous subsection can be performed 
here because cross-validation is a special case of random subsampling where we ensure 
that the test sets in one run do not overlap. (Of course, test sets from different runs will 
overlap.) This results in the following statistic: 



where ni is the number of instances used for training, and H 2 the number of instances 
used for testing. We call this test the “corrected repeated k-fold cv test”. 

4 Empirical Evaluation 

To evaluate how replicability affects the various tests, we performed experiments on a 
selection of datasets from the UCI repository [4]. We used naive Bayes, C4.5 [7], and 
the nearest neighbor classifier, with default settings as implemented in Weka^ version 
3.3 [1]. For tests that involve multiple folds, the folds were chosen using stratification, 
which ensures that the class distribution in the whole dataset is reflected in each of the 
folds. Each of the tests was run ten times for each pair of learning schemes and a 5% 
significance level was used in all tests unless stated otherwise. 

4.1 Results for the 5x2cv Paired t-Test 

Table 1 shows the datasets and their properties, and the results for the 5x2 cross validation 
test. The three right-most columns show the number of times the test does not reject the 
null hypothesis, i.e, the number of times the 5x2 cross validation test indicates that there 

^ Weka is freely available with source from http://www.cs.waikato.ac.nz/ml. 
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Table 1. The number of cases (#inst.), attributes (#atts.), and classes (#cl.) for each dataset; and 
the number of draws for each pair of classifiers based on the 5x2 cross validation test (NB = naive 
Bayes, NN = nearest neighbor). 



dataset 


#inst. 


#atts. 


#cl. 


NB vs C4.5 


NB vs NN 


C4.5 vs NN 


anneal 


898 


38 


5 


4 


4 


10 


arrhythmia 


452 


280 


13 


9 


9 


2 


audiology 


226 


69 


24 


5 


10 


8 


autos 


205 


25 


6 


10 


7 


10 


halance-scale 


625 


4 


3 


1 


4 


7 


breast-cancer 


286 


9 


2 


10 


9 


8 


credit-rating 


690 


16 


2 


6 


8 


10 


ecoli 


336 


8 


8 


7 


10 


10 


German credit 


1000 


20 


2 


9 


6 


10 


glass 


214 


9 


6 


6 


6 


9 


heart-statlog 


270 


13 


2 


4 


5 


9 


hepatitis 


155 


19 


2 


9 


10 


10 


horse-colic 


368 


22 


2 


8 


10 


7 


Hungarian 


294 


13 


2 


10 


10 


10 


heart disease 














ionosphere 


351 


34 


2 


10 


10 


8 


iris 


150 


4 


3 


10 


10 


10 


labor 


57 


16 


2 


8 


10 


10 


lymphography 


148 


18 


4 


9 


10 


10 


pima-diabetes 


768 


8 


2 


10 


6 


7 


primary-tumor 


339 


17 


21 


7 


3 


10 


sonar 


208 


60 


2 


10 


9 


6 


soybean 


683 


35 


19 


8 


8 


9 


vehicle 


846 


18 


4 


0 


0 


9 


vote 


435 


16 


2 


4 


9 


7 


vowel 


990 


13 


11 


4 


0 


0 


Wisconsin 


699 


9 


2 


8 


9 


10 


breast cancer 














ZOO 


101 


16 


7 


10 


10 


8 


Consistent: 


9 


12 


13 


Almost consistent: 


14 


17 


17 


Replicability (R): 


0.737 


0.783 


0.816 



is no difference between the corresponding pair of classifiers. For example, for the anneal 
dataset, the test indicates no difference between naive Bayes and C4.5 four times, so six 
times it does indicate a difference. Note that the same dataset, the same algorithm, the 
same settings, and the same significance test were used in each of the ten experiments. 
The only difference was in the way the dataset was split into the 2 folds in each of the 5 
runs. Clearly, the test is very sensitive to the particular partitioning of the anneal data. 

Looking at the column for naive Bayes vs. C4.5, this test could be used to justify 
the claim that the two perform the same for all datasets except the vehicle dataset just 
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by choosing appropriate random number seeds. However, it could just as well be used 
to support the claim that the two algorithms perform differently in 19 out of 27 cases. 

For some rows, the test consistently indicates no difference between any two of the 
three schemes, in particular for the iris and Hungarian heart disease datasets. However, 
most rows contain at least one cell where the outcomes of the test are not consistent. 

The row labeled “consistent” at the bottom of the table lists the number of datasets 
for which all outcomes of the test are the same. These are calculated as the number of O’s 
and lO’s in the column. For any of the compared schemes, less than 50% of the results 
turn out to be consistent. 

Note that, it is possible that, when comparing algorithms A and B, sometimes A is 
preferred and sometimes B if the null hypothesis is rejected. However, closer inspection 
of the data reveals that this only happens when the null hypothesis is accepted most of 
the time, except for 2 or 3 runs. Consequently these cases do not contribute to the value 
of the consistency measure. 

If we could accept that one outcome of the ten runs does not agree with the rest, we 
get the number labeled “almost consistent” in Table 1 (i.e. the number of O’s, I’s, 9’s 
and lO’s in a column). The 5x2 cross validation test is almost consistent in fewer than 
66% of the cases, which is still a very low rate. 

The last row shows the value of the replicability measure R for the three pairs of 
learning schemes considered. These results reflect the same behaviour as the consistency 
measures. The replicability values are pretty low considering that R cannot be smaller 
than 0.5. 



4.2 Results for the Corrected Resampled t-Test 

In the resampling experiments, the data was randomized, 90% of it used for training, 
and the remaining 10% used to measure accuracy. This was repeated with a different 
random number seed for each run. Table 2 shows the results for the corrected resampled 
r-test. The number of runs used in resampling was varied from 10 to 100 to see the effect 
on the replicability. 

The replicability increases with the number of runs almost everywhere. The only 
exception is in the last row, where the “almost consistent” value decreases by one when 
increasing the runs from 10 to 20. This can be explained by random fluctuations due to 
the random partitioning of the datasets. Overall, the replicability becomes reasonably 
acceptable when the number of runs is 100. In this case 80% of the results are “almost 
consistent”, and the value of the replicability measure R is approximately 0.9 or above. 

4.3 Results for Tests Based on (Repeated) Cross Validation 

For the standard r-test based on a single run of 10-fold cross validation we observed 
consistent results for 15, 16, and 14 datasets, comparing NB with C4.5, NB with NN, 
and C4.5 with NN respectively. Contrasting this with corrected resampling with 10 runs, 
which takes the same computational effort, we see that 10-fold cross validation is at least 
as consistent. However, it is substantially less consistent than (corrected) resampling at 
100 runs. Note also that this test has an inflated Type I error [5]. 
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Table 2. Replicability for corrected resampled f-test. 




Performing the same experiment in conjunction with the standard f-test based on 
the 100 differences obtained by 10-times 10-fold cross validation, produced consistent 
results for 25, 24, and 18 datasets, based on NB with C4.5, NB with NN, and C4.5 with 
NN respectively. This looks impressive compared to any of the tests we have evaluated 
so far. However, the Type I error of this test is very high (because of the overlapping 
training and test sets) and therefore it should not be used in practice. 

To reduce Type I error it is necessary to correct the variance. Table 3 shows the same 
results for the corrected paired f-test based on the paired outcomes of r-times 10-fold 
cross validation. Comparing this to Table 2 (for corrected resampling) the consistency 
is almost everywhere as good and often better (assuming the same computational effort 
in both cases): the column with 1 run in Table 3 should be compared with the 10 runs 
column in Table 2, the column with 2 runs in Table 3 with the column with 20 runs in 
Table 2, etc. The same can be said about the replicability measure R. This indicates that 
repeated cross validation helps to improve replicability (compared to just performing 
random subsampling). 

To ensure that the improved replicability of cross-validation is not due to stratification 
(which is not performed in the case of random subsampling), we performed an exper- 
iment where resampling was done with stratification. The replicability scores differed 
only very slightly from the ones shown in Table 2, suggesting the improved replicability 
is not due to stratification. 

Because the corrected paired f-test based on 10-times 10-fold cross validation ex- 
hibits the best replicability scores, we performed an experiment to see how sensitive its 
replicability is to the significance level. The results, shown in Table 4, demonstrate that 
the significance level does not have a major impact on consistency or the replicability 
measure R. Note that the latter is greater than 0.9 in every single case, indicating very 
good replicability for this test. 
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Table 3. Replicability for corrected rxlO fold Table 4. Replicability of corrected 10x10 fold 
cross-validation test. cross-validation test for various significance 



levels. 




Table 5. Results for data sources 1 to 4: the difference in accuracy between naive Bayes and C4.5 
(in percent) and the consistency of the tests (in percent). 



Source 


1 


2 


3 


4 




A accuracy 


0.0 


2.77 


5.83 


11.27 


min. 


5x2 cv 


72.3 


71.2 


63.5 


16.9 


16.9 


10 X resampling 


65.5 


44.0 


26.0 


48.8 


26.0 


100 X resampling 


90.9 


73.2 


66.8 


97.2 


66.8 


10-fold cv 


49.7 


47.6 


33.2 


90.8 


33.2 


corrected 10x10 fold cv 


91.9 


80.3 


76.7 


98.9 


76.7 



4.4 Simulation Experiment 

To study the effect of the observed difference in accuracy on replicability, we performed 
a simulation study. Four data sources were selected by randomly generating Bayesian 
networks over 10 binary variables where the class variable had 0.5 probability of being 
zero or one. A 0.5 probability of the class variable is known to cause the largest vari- 
ability due to selection of the training data [5]. The first network had no arrows and all 
variables except the class variables were independently selected with various different 
probabilities. This guarantees that any learning scheme will have 50% expected accu- 
racy on the test data. The other three data sources had a BAN structure [8], generated by 
starting with a naive Bayes model and adding arrows while guaranteeing acyclicity. 

Using stochastic simulation [9], a collection of 1000 training sets with 300 instances 
each was created. Naive Bayes and C4.5 were trained on each of them and their accuracy 
measured on a test set of 20,000 cases, generated from each of the data sources. The 
average difference in accuracy is shown in Table 5 in the row marked A accuracy, and 
it ranges from 0% to 1 1.27%. 
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Each of the tests was run 10 times on each of the 4 x 1000 training sets. Table 5 
shows, for each of the tests and each data source, the percentage of training sets for 
which the test is consistent (i.e., indicates the same outcome 10 times). The last column 
shows the minimum of the consistency over the four data sources. 

Again, 5x2 cross validation, 10 times resampling, and 10 fold cross validation show 
rather low consistency. Replicability increases dramatically with 100 times resampling, 
and increases even further when performing 10 times repeated 10 fold cross validation. 
This is consistent with the results observed on the UCI datasets. 

Table 5 shows that the tests have fewer problems with data sources 1 and 4 (apart 
from the 5 x 2 cv test), where it is easy to decide whether the two schemes differ. The 
5x2 test has problems with data source 4 because it is a rather conservative test (low 
Type I error, high Type II error) and tends to err on the side of being too cautious when 
deciding whether two schemes differ. 

5 Conclusions 

We considered tests for choosing between two learning algorithms for classification 
tasks. We argued that such a test should not only have an appropriate Type I error and 
low Type II error, but also high replicability. High replicability facilitates reproducing 
published results and reduces the likelihood of oversearching. In our experiments, good 
replicability was obtained using 100 runs of random subsampling in conjunction with 
Nadeau and Bengio’s corrected resampled r-test, and replicability improved even fur- 
ther by using 10-times 10-fold cross-validation instead of random subsampling. Both 
methods are acceptable but for best replicability we recommend the latter one. 
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Abstract. Data mining problems often involve a large amount of un- 
labeled data and there is often very limited known information on the 
dataset. In such scenario, semi-supervised learning can often improve 
classification performance by utilizing unlabeled data for learning. In 
this paper, we proposed a novel approach to semi-supervised learning as 
as an optimization of both the classification energy and cluster compact- 
ness energy in the unlabeled dataset. The resulting integer programming 
problem is relaxed by a semi-definite relaxation where efficient solution 
can be obtained. Furthermore, the spectral graph methods provide im- 
proved energy minimization via the incorporation of additional criteria. 
Results on UCI datasets show promising results. 



1 Introduction 

In this paper, we adopt an energy minimization approach to semi-supervised 
learning where the energy function is composed of two distinct parts: the super- 
vised energy function and the unsupervised energy function. The unsupervised 
energy incorporates measures on cluster compactness. The supervised energy is 
specified as function of classification probability of any chosen classifier. 

The modeling of the semi-supervised learning process as two separate en- 
ergy functions allows considerable flexibility in incorporating existing clustering 
model and selection of suitable classifier. While the estimation of unknown labels 
via minimization of energy function is an integer programming problem which 
often leads to solutions of combinatorial nature. We propose to adopt a semi- 
definite relaxation of the integer programming to obtain solution in polynomial 
time. Using results from spectral graph theory, minimum energy solutions can 
be attained that satisfiy different desired constraint. 

This paper is organized as follows. In section 2, we discuss the energy min- 
imization approach to semi-supervised learning. In section 3, we describe the 
relationships with graph. In section 4, we discuss the spectral methods for en- 
ergy minimization. Section 5 gives experimental results on datasets from the 
UCI repostory. 

* The funding from HKBU FRG/01-02/II-67 is acknowledged 
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2 Energy Function for Semi-supervised Learning 

2.1 Energy Minimization 

Given dataset D = {(xi,yi), . . . , (x„,j/„)} G x {±1} and n is the total 
number of data, D can also be partitioned into two sets D = {LUC/} where 
L is the set of labeled data with known classification labels and U is the set 
of unlabeled data with unknown classification labels. Classification algorithms 
often relies primarily on the labeled dataset L for learning rule or mapping to 
classify any unlabeled dataset. On the other hand, clustering algorithms often 
relies on compactness measures or probabilistic assumptions on the unlabeled 
dataset U to infer the possible groupings and hence the values of the labels ym- 
In general, learning from labeled dataset L and unlabeled dataset U can be a 
simultaneous optimization of these two criteria 

E{D) = Esup{D) + XEunsup(D) (1) 

where E^up measures the agreement between the entire dataset with the training 
dataset, E^nsup measures the compactness of the entire dataset and A is a real 
positive constant that balances the importance of the two criteria. As the simul- 
taneous optimization of the above criteria often leads to combinatorial problems 
which cannot be solved in polynomial time, the development of the energy func- 
tion is emphasized on arriving at a continuous variations where solutions can be 
obtained in polynomial time. 

2.2 Unsupervised Learning 

There are various forms of energy functions for unsupervised learning. The pri- 
mary objective is to find a partitioning of the dataset such that the labels are 
assigned to minimize the within-class distance or to minimize the between-class 
similarity. A simple quadratic energy for minimizing the within-class distance is 
Si j-y=i/ where the Euclidean distance between x^ and Xj is calcu- 

lated only if both have the same labels. However, recent advances in clustering 
and kernel based learning motivates the use of similarity measures in the form of 
a dot-product. Let Ki^ represents the similarity between Xi and Xj, the between- 
class similarity can be specified as 

Eunsup(D) = ^ Kij (2) 

which is equivalent to the following form where the discrete variables y € -1-1, —1 
are explicitly specified in a product form 

EunsupiE') = ^ ^ Ki j(^yi yj') (3) 

i,3 

While the above equations are equivalent when the variable y takes only the 
value {-|-1,-1}, Eqn 3 caters for the case where y is allowed to take a continuous 
value. 
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2.3 Supervised Learning 

We consider a generic model of classifier using the class output probability model 
of a classifier in this paper. Although only a small number of classifiers have 
class probability model, classification scores can often be given to classifiers that 
reflects some measures of confidence on the classification. Assuming a generic 
classifier C which output conditional probability P(j/i|xi), the supervised energy 
Esup can be specified as 

Esup{D) = P{+l\'x.i){yi — (- 1 - 1 ))^ -|- 

i 

^P(-l|x,)(y,-(-l)f (4) 

i 

Without additional constraint, this criteria will give labels in accordance with 
the conditional probability. The second step is to further relax on the fixed labels 
{-|-1, — 1} inside the quadratic term to allow it to take two continuous values to 
be denoted by {uz+tUz-} where 

Esup{D) = ^P(-hl|x,)(y, - yz+f + 
i 

^P(-l|xi)(y, - (5) 

i 

As yi is allowed to take continuous value, we need to subject the optimization 
to feasible solution space. Suitable constraint includes the symmetric constraint 
where J/i = 0 and the normalization constraint 'Yhi Ui = ^ which set a scale 
on the solution. In particular, the normalization constraint of setting A: = 1 is 
usually taken. 

Construct a symmetric matrix H with positive entries by 

' Kij if 1< i,j<n 

-|P(-|-l|xj) if i = n -I- 1 and 1 < j < n 

„ _ ^ |P(-|-l|xi) if j = n -I- 1 and 1 < z < n , . 

-|P(— l|xj) if z = n -I- 2 and 1 < j < n 

^P(— l|xi) if j = n -I- 2 and 1 < z < n 

0 ifrz-|-l<z,j<n-|-2 

and letting yn+i denotes the value yz+ and yn+2 denotes the value yz-, enables 
the total energy function to be represented in the following form 

E{D) = J2H^Ay-yJf ( 7 ) 

id 

subject to the constraint where yf = 1 and yi = 0. This optimization 
problem can then be solved via standard spectral methods. Before going into 
details of using different spectral methods for solving the above optimization 
problems, the relationship with graph-based method is first discussed. 
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3 Graph-Based Modeling 

Constrained minimum cuts incorporating labeled data into similarity graph has 
been proposed for semi-supervised learning (Li, 2001). The minimization of the 
constrained cut is approximated via a variation of the randomized contraction 
algorithm of (Karger & Stein, 1996) and the class labels are estimated via com- 
puting the class probability via averaging over an ensemble of approximate min- 
imum cuts partitions. Another type of similarity graph has been proposed by 
(Blum & Chawla, 2001) which provide several theoretical justification in terms 
of LOOCV errors of nearest neighbor classifier and a generative model from an 
underlying distribution. 

We model the data on an undirected weighted graph G = (C, E) where V is 
the set of vertices and E is the set of edges. The weight on edge represents the 
similarity between the data. The similarity matrix represents the similarity 
between the vector and xj. The exponential similarity 

fe(xi,xj) = exp(--^J^^^— ^^) (8) 



is adopted in this work. 

3.1 Constrained Minimum Cut for Semi-supervised Learning 

Given labeled data of two classes stored in the sets a,b where i G a if yi = -1-1 and 
i G b if yi = —1, and a fl 6 = </>. The constrained minimum cut is a constrained 
combinatorial minimization for finding a disjoint partition of V into A and B 
such that. 



A, B = ai'gmmcut{A, B) where aCA and b C B. (9) 

This constrained minimum cut partitions the unlabeled data into sets with small- 
est similarity while each of the partitions contains only vertices having identical 
training labels. 

The above optimization involves hard constraint where the labeled data act 
as constraint for the minimum cut optimization. If the hard constraint is replaced 
by fixed cost function, the criterion function for semi-supervised learning is 

A,B = arg min cut{A, B) + pg{A, B, a, b) (10) 

where p is a positive integer balancing the compactness of the segmented graph 
and the agreement with the labeled samples and p(a, b) is a function measuring 
the disagreement of the labeled samples with the estimated classification. Set- 
ting the balancing constant p to approach infinity would be equivalent to the 
constrained minimum cut formulation in Eqn. 9. The number of labeled samples 
that is assigned to the wrong class can be used to construct g, 



g{A, B, a, b) = N{a fl B) -I- N{b fl A) 



( 11 ) 
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where N{.) is the number of elements in the set. Thus the function g measures 
the number of labeled data that is assigned to other class. The optimization 
problem in Eqn.lO can be simplified by integrating the cost function g into 
the original graph G. The augmented graph G’ can be constructed by adding 
new edges and new nodes to the original graph G. We can construct two extra 
nodes /_i, Z+i, each of which represents the two class labels. The cost function of 
labeled samples can then be embedded in the augmented graph by adding edges 
connecting the above nodes with the labeled samples, 



^* 7+1 =P if * e « 
= P if * e & 



( 12 ) 



It is straightforward to observe that the minimum cut on the augmented graph is 
equivalent to minimizing the criterion function in Eqn. 10 under the assumption 
that the label nodes are not merged into a single class. 



4 Spectral Energy Minimization 



Graph spectral partitioning is an effective approximation for graph partitioning 
(Kannan et ah, 2000), (Drineas et ah, 1999), (Ding et ah, 2001), (Gristianini 
et ah, 2002). Furthermore, studies in using spectral methods for optimal di- 
mension embedding provides additional motivation for its natural capabilities of 
representing high dimension data in low dimension (Belkin & Niyogi, 2002). 

In this section, we discuss the spectral methods for minimizing the energy 
function in Eqn. 7. First we notice that the energy minimization has a trial solu- 
tion at y with values all Is or —Is. We set j/^e = 0 to restrict the minimization 
to a balanced partitioning 

^=^(E^b-y'Hy) = ^y'Ly, (13) 

where L = D — H and D = diag(di, . . . , dji) with di = 
eigenvector of L associated with the second smallest eigenvalue be denoted as 
y 2 , which is orthogonal to the eigenvector, e, corresponding to the smallest 
eigenvalue, 0. 

The normalized cut alternatively minimizes the following objective cut under 
the constraint yDe = 0. 



E 



= E,, 






Hiji 



E, 



oc 



£A,jGAuB 

y'Ly 

y'Dy ■ 



, Hi. 



E. 



GBJGAuB 



Hi. 



And by solving the generalized eigen-system 

(D-H)y = ADF, 



(14) 
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the eigenvector corresponding to the second smallest eigenvalue 2/2 is used as the 
average cut case. 

The algorithm for the spectral energy minimization is as follows: 



1. Construct similarity matrix K with all samples 

2. Construct probability +similarity matrix H by adding probability of classi- 
fication 

3. Compute the fc-th smallest eigenvalues and the associated eigenvectors by 
solving the suitable eigenvalue problem 

4. Perform classification on the /c-th dimension subspace with the fc-th dimen- 
sion eigenvector using labeled data 



In previous work in clustering, the choice of the number of eigenvectors is often 
limited to observations and intuitions. There is considerable debate on whether 
the second eigenvector is sufficient and whether the more eigenvectors the better 
is the clustering. However, in the case of semi-supervised learning, the errors 
on the training samples can be used to select the number of dimensions to be 
employed. 

A simple classifier based on weighted distance constructed by weighting all 
training samples with exponential distances has been adopted. Assuming the 
similarity is constructed as in Eqn.8, the classifier’s output can be represented 
via the following 



Hi, 71+1 



^i+yt 



(15) 



where t indices only the labeled data and Hi 77^2 = 1 ~ Similarly, the 

entries on the transpose is constructed. Although this classifier has relatively 
poor performance on its own, the total energy minimization does achieves very 
good performance. 



5 Experiments 

The spectral method is tested on the following datasets: crab, Wisconsin breast 
cancer, pima indian diabetes, vote, and ionosphere. For the spectral method, 
the scale constant a in the radial basis function is set as the average value of 
eight nearest-neighbour distance. The balance constant A controlling the fidelity 
to training labels and cluster compactness is set as a fixed value of 27 t for all 
datasets. The choice of the value of 27 t is only arbitrary as the algorithm is 
not sensitive to the selection of scale constant. All experiments are repeated ten 
times and the average classification error on the unlabeled dataset are reported. 
The crab data consists of two species, each species is further broken down into 
two sexes (Ripley, 1996). The classification of the species is relatively easy and 
the distinction of the two sexes is more challenging. In this experiment, the task 
is to classify the sex of the crab given a limited number of training samples. 
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5.1 Comparison with the Semi-supervised kmeans Algorithm 

Figure 1(a) shows the classification error of the spectral method compared with 
the semi-supervised k-means for the crab dataset (Basu et ah, 2002). It is clear 
from the figure that the proposed spectral method is capable of learning from 
both labeled and unlabeled data and perform better than both the constrained 
k-mean and the seeded k-mean algorithm. 







(c) Breast Cancer (d) Diabetes 

Fig. 1. Average Classification Error 

Figure 1(b) shows the classification error of the spectral method compared 
with the semi-supervised k-means for the ionosphere dataset. The spectral 
method performs also very good on this dataset compared with the semi- 
supervised k-means algorithm. Furthermore, comparing the semi-supervised 
graph mincut’s accuracy 81.6% using 50 training samples, the spectral algo- 
rithm achieve an accuracy of 88.3% at 48 training samples (Blum & Chawla, 
2001 ). 

Figure 1(c) shows the classification error of the spectral method compared 
with the semi-supervised k-means for the breast cancer dataset. The spectral 
method performs also very good on this dataset compared with the semi- 
supervised k-means algorithm. Although there are fluctuations in the classifi- 
cation errors of the spectral algorithm, the range of the fluctuations are within 
0.5% which is reasonable given the randomly drawn training samples. Figure 1(d) 
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shows the classification error of the spectral method compared with the semi- 
supervised k-means for diabetes dataset. All data with missing values are re- 
moved from the dataset. The spectral method has larger errors when there is 
a smaller number of training data while the semi-supervised k-means does not 
gain a lot of performance with more training data. 




(a) Breast Cancer. (b) Diabetes 




(c) Ionosphere 



(d) Votes 



Fig. 2. Average Classification Error c.f. TSVM 



5.2 Comparison with the Transductive Support Vector Machine 

In this section, comparisons are performed with the transductive support vector 
machine (TSVM) and support vector machine (Joachims, 1999). The tests with 
the TSVM are difficult to be performed as there are different parameters to be 
estimated and the algorithm does not always converge. Effort has been spent on 
obtaining the best possible results using leave-one-out cross validation (LOOCV) 
for selecting the parameters of the TSVM. For each set of training data, a pair of 
a and C that leads to the best LOOCV accuracy is obtained through grid search 
on the region spanned by cr G {2*|i = —5, . . . , 5} and C G {2*|i = —5, . . . , 10}. 
However, the computational requirement of such a process is order of magnitude 
larger than the spectral approach and a lot of manual effort has been spend in 
setting the parameters in optimization. 

Figure 5.1(a) to Figure 5.1(d) shows the classification error of the spectral 
method compared with SVM and TSVM. The spectral method with fixed pa- 
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rameter and simple classifier probability estimation is able to achieve better per- 
formance than SVM and comparable performance to TSVM in some datasets. It 
is not surprising the TSVM has in general superior performance especially when 
there are more training data. 

6 Conclusion 

Posing the semi-supervised learning as the optimization of both the classification 
energy and the cluster compactness energy is a flexible representations where 
different classifiers and different similarity criteria can be adopted. The resulting 
integer programming can be relaxed by a semi-definite relaxation which allows 
efficient solutions to be obtained via spectral graph methods. The spectral energy 
minimization method perform very well when compared with constrained k- 
means and SVM. Although results from spectral energy minimization do not 
exceed the performance of the best result achieved by the transductive SVM, it 
provides a fast solution without difficult parameters estimations and convergence 
problems. 
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Abstract. In this paper we present methods of enhancing existing di- 
scriminative classihers for multi-labeled predictions. Discriminative me- 
thods like support vector machines perform very well for uni-labeled text 
classification tasks. Multi-labeled classification is a harder task subject 
to relatively less attention. In the multi-labeled setting, classes are often 
related to each other or part of a is-a hierarchy. We present a new tech- 
nique for combining text features and features indicating relationships 
between classes, which can be used with any discriminative algorithm. 
We also present two enhancements to the margin of SVMs for building 
better models in the presence of overlapping classes. We present results 
of experiments on real world text benchmark datasets. Our new methods 
beat accuracy of existing methods with statistically signihcant improve- 
ments. 



1 Introduction 

Text classification is the task of assigning documents to a pre-specified set of 
classes. Real world applications including spam filtering, e-mail routing, organi- 
zing web content into topical hierarchies, and news filtering rely on automatic 
means of classification. Text classification can be broadly categorized into discri- 
minative techniques, typified by support vector machines [1] (SVMs), decision 
trees [2] and neural networks; and generative techniques, like Naive Bayes (NB) 
and Expectation Maximization (EM) based methods. From a performance point 
of view, NB classifiers are known to be the fastest, learning a probabilistic ge- 
nerative model in just one pass of the training data. Their accuracy is however 
relatively modest. At the other end of the spectrum lie SVMs based on elegant 
foundations of statistical learning theory [3] . Their training time is quadratic to 
the number of training examples, but they are known to be the most accurate. 

The simplest task in text classification is to determine whether a document 
belongs to a class of interest or not. Most applications require the ability to 
classify documents into one out of many (> 2) classes. Often it is not sufficient 
to talk about a document belonging to a single class. Based on the granularity 
and coverage of the set of classes, a document is often about more than one 
topic. A document describing the politics involved in the sport of cricket, could 
be classified as Sports/Cricket, as well as Society/Politics. When a docu- 
ment can belong to more than one class, it is called multi-labeled. Multi-labeled 
classification is a harder problem than just choosing one out of many classes. 



H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 22-30, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 



Discriminative Methods for Multi-labeled Classification 



23 



In this paper, we present algorithms which use existing discriminative classi- 
fication techniques as building blocks to perform better multilabeled classifica- 
tion. We propose two enhancements to existing discriminative methods. First, 
we present a new algorithm which exploits correlation between related classes in 
the label-sets of documents, by combining text features and information about 
relationships between classes by constructing a new kernel for SVMs with hete- 
rogeneous features. Next, we present two methods of improving the margin of 
SVMs for better multilabeled classification. We present experiments comparing 
various multilabeled classification methods. Following this, we review related 
work and conclude with future research directions. 



2 Multi-labeled Classification Using Discriminative 
Classifiers 

Suppose we are given a vector space representation of n documents. In the bag- 
of-words model, each document vector di has a component for each term feature 
which is proportional to it’s importance (term frequency or TFIDF are commonly 
used). Each document vector is normalized to unit L 2 norm and is associated 
with one of two labels, -1-1 or —1. The training data is thus {{dj, Ci),j = 1, . . . , n}, 

Ci S {—1, +!}• 

A linear SVM finds a vector w and a scalar constant b, such that for all i, Ci 
{'Wci ■ dj + b) > 1, and ||w|| is minimized. This optimization corresponds to fitting 
the thickest possible slab between the positive (c = -1-1) and negative (c = — 1) 
documents. 

Most discriminative classifiers, including SVMs, are essentially twoclass clas- 
sifiers. A standard methods of dealing with multi-class problems is to create an 
ensemble of yes/no binary classifiers, one for each label. This method is called 
one-vs-others. For each label h, the positive class includes all documents which 
have li as one of their labels and the negative side includes all other documents. 
During application, the set of labels associated with a document dj is {fc}, such 
that Wfe ■ dj + bk > 0. This is the basic SVM method (denoted SVM) that serves 
as a baseline against which we compare other methods. 

2.1 Limitations of the Basic SVM Method 

Text classification with SVMs is faced with one issue; that of all classifiers in an 
ensemble rejecting instances. In one-vs-others, all constituents of the ensemble 
emit a (Wg ■ d + be) score; for multi-labeled classification we admit all classes in 
the predicted set, whose score Wc ■ d + be > 0. However, in practice, we find that 
a significant fraction of documents get negative scores by all the classifiers in the 
ensemble. 

Discriminative multi-class classification techniques, including SVMs, have hi- 
storically been developed to assign an instance to exactly one of a set of classes 
that are assumed to be disjoint. In contrast, multi-labeled data, by its very na- 
ture, consists of highly correlated and overlapping classes. For instance, in the 
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Reuters-21578 dataset, there are classes like wheat-grain, crude-fuel, where one 
class is almost a parent of the other class although this knowledge is not expli- 
citly available to the classifier. Such overlap among classes hurts the ability of 
discriminative methods to identify good boundaries for a class. We devise two 
techniques to handle this problem in Section 4. Correlation between classes can 
be a boon as well. We can exploit strong mutual information among subsets of 
classes to “pull up” some classes when the term information is insufficient. In 
the next section, we present a new method to directly exploit such correlation 
among classes to improve multi-label prediction. 



3 Combining Text and Class Membership Features 

The first opportunity for improving multi-labeled classification is provided by 
the co-occurrence relationships of classes in label sets of documents. We propose 
a new method for exploiting these relationships. 

If classification as class Ci is a good indicator of classification as class Cj , one 
way to enhance a purely text-based SVM learner is to augment the feature set 
with \C\ extra features, one for each label in the dataset. The cyclic dependency 
between features and labels is resolved iteratively. 

Training: We first train a normal text-based SVM ensemble S'(O). Next, 
we use S'(O) to augment each document d £ D with a set of |C| new columns 
corresponding to scores ■ d + bc^ for each class Ci G C. All positive scores 
are transformed to -1-1 and all negative scores are transformed to —1. In case all 
scores output by S'(O) are negative, the least negative score is transformed to -1-1. 
The text features in the original document vector are scaled to /(O < / < 1), 
and the new “label dimensions” are scaled to (1 — /). Documents in D thus get a 
new vector representation with |T| -1-1(71 columns where |T| is the number of term 
features. They also have a supervised set of labels. These are now used to train a 
new SVM ensemble <S'(1). We call this method SVMs with heterogeneous feature 
kernels (denoted SVM-HF). The complete pseudo-code is shown in Figure 1. 
This approach is directly related to our previous work on Cross- Training [4] 
where label mappings between two different taxonomies help in building better 
classification models for each of the taxonomies. 

Testing: During application, all test documents are classified using 5'(0). For 
each document, the transformed scores are appended in the \C\ new columns with 
appropriate scaling. These document are then submitted to 5'(1) to obtain the 
final predicted set of labels. 

The scaling factor: The differential scaling of term and feature dimensions 
has special reasons. This applies a special kernel function to documents during 
training 5(1). The kernel function in linear SVMs gives the similarity between 
two document vectors, KT{di,dj) = When document vectors are scaled 

to unit L 2 norm, this becomes simply the cos 6 of the angle between the two 
document vectors, a standard IR similarity measure. Scaling the term and label 
dimensions sets up a new kernel function given by K{di,dj) = / • KT{di,dj) + 
(1 — /) • KL{di,dj), where Kt is the usual dot product kernel between terms 
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1: Represent each document as a vector d in term space and ||d|| = 1 
2; Build one-vs-rest SVM classifier S(0) using text tokens only 
3: for each document d G D do 

4: Apply 5(0) to d, getting a vector 7 c(d) of \C\ scores (see text) 

5: Concatenate vectors d and 'yc(d) into a single training vector with label 

carried over from 5(0), with relative term-label weight determined by /, 
maintaining ||d|| = 1 

6: Add this vector into the training set of 5(1) 

7: end for 

8: Induce a new one-vs-rest SVM classifier 5(1) for all d £ D 



Fig. 1. SVMs with heterogeneous feature kernels 



and Kl is the kernel between the label dimensions. The tunable parameter 
/ is chosen through cross-validation on a held out validation set. The label 
dimensions interact with each other independent of the text dimensions in the 
way we set up the modified kernel. Just scaling the document vector suitably is 
sufficient to use this kernel and no change in code is needed. 

4 Improving the Margin of SVMs 

In multi-labeled classification tasks, the second opportunity for improvement 
is provided by tuning the margins of SVMs to account for overlapping classes. 
It is also likely that the label set attached with individual in- stances is in- 
complete. Discriminative methods work best when classes are disjoint. In our 
experience with the Reuters-21578 dataset, multi-labeled instances often seem 
to have incomplete label sets. Thus multi-labeled data are best treated as ‘par- 
tially labeled’. Therefore, it is likely that the ‘others’ set includes instances that 
truly belong to the positive class also. We propose two mechanisms of removing 
examples from the large negative set which are very similar to the positive set. 
The first method does this at the document level, the second at the class level. 

4.1 Removing a Band of Points around the Hyperplane 

The presence of very similar negative training instances on the others side for 
each classifier in an SVM ensemble hampers the margin, and re-orients the se- 
parating hyperplanes slightly differently than if these points were absent. If we 
remove these points which are very close to the resultant hyperplane, we can 
train a better hyperplane with a wider margin. The algorithm to do this consists 
of two iterations: 

1. In the first iteration, train the basic SVM ensemble. 

2. For each SVM trained, remove those negative training instances which are 
within a threshold distance (band) from the learnt hyperplane. Re-train the 
ensemble. 
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We call this method the band-removal method (denoted BandSVM) . When 
selecting this band, we have to be careful not to remove instances that are 
crucial in defining the boundary of the others class. We use a heldout validation 
dataset to choose the band size. An appropriate band-size tries to achieve the fine 
balance between large-margin separation, achieved by removing highly related 
points, and over-generalization, achieved by removing points truly belonging to 
the negative class. 



4.2 Confusion Matrix Based “Others” Pruning 

Another way of countering very similar positive and negative instances, is to 
completely remove all training instances of ‘confusing’ classes. Confusing classes 
are detected using a confusion matrix quickly learnt over held out validation 
data using any moderately accurate yet fast classifier like naive Bayes [5]. The 
confusion matrix for a n-class problem is nXn matrix M, where the entry, 
Mij, gives the percentage of documents of class i which were misclassified as 
class j. If Mij is above a threshold (3, we prune away all confusing classes (like 
j) from the ‘others’ side of i when constructing a i-vs-others classifier. This 
method is called the confusionmatrix based pruning method (denoted ConfMat). 
This two-step method is specified as: 

1. Obtain a confusion matrix M over the original learning problem using any 
fast, moderately accurate classifier. Select a threshold (3. 

2. Construct a one-vs-others SVM ensemble. For each class i, leave out the 
entire class j from the ‘others’ set if > (3. 

If the parameter (3 is very small a lot of classes will be excluded from the 
others set. If it is too small, none of the classes may be excluded resulting in the 
original ensemble. (3 is chosen by cross-validation. 

ConfMat is faster to train than BandSVM, relying on a confusion matrix 
given by a fast NB classifier, and requires only one SVM ensemble to be trained. 
The user’s domain knowledge about relationships between classes (e.g. hierar- 
chies of classes) can be easily incorporated in ConfMat. 

5 Experiments 

We describe experiments with text classification benchmark datasets and report 
the results of a comparison between the various multi-labeled classification me- 
thods. We compare the baseline SVM method with ConfMat, BandSVM, and 
SVM-HF. 

All experiments were performed on a 2-processor 1.3GHz P3 machine with 
2GB RAM, running Debian Linux. Rainbow^ was used for feature and text pro- 
cessing and SVMLight^ was used for all SVM experiments. 

^ http://www.cs.cmu.edu/ mccallum/bow/ 

^ http://svmlight.joachims.org 



Discriminative Methods for Multi-labeled Classification 



27 



5.1 Datasets 

Reuters-21578: The Reuters-21578 Text Categorization Test Collection is a stan- 
dard text categorization benchmark. We use the Mod-Apte split and evaluate 
all methods on the given train/test split with 135 classes. We also separately use 
random 70-30 train/test splits (averaged over 10 random splits), to test stati- 
stical significance, for a subset of 30 classes. We did feature selection by using 
stemming, stopword removal and only considered tokens which occurred in more 
than one document at least once, and selected the top 1000 features by mutual 
information. 

Patents: The Patents dataset is another text classification benchmark. We 
used the wipo-alpha collection which is an English language collection of patent 
applications classified into a hierarchy of classes with subclasses and groups. 
We take all 114 sub-classes of the top level (A to H) using the given train/test 
split. We also report average over 10 random 70-30 train/test splits for the 
F sub-hierarchy. We consider only the text in the abstract of the patent for 
classification and feature selection is the same as that for the Reuters dataset. 



5.2 Evaluation Measures 

All evaluation measures discussed are on a per instance basis and the aggregate 
value is an average over all instances. For each document dj, let T be the true set 
of labels, S be the predicted set of labels. Accuracy is measured by the Hamming 
score which symmetrically measures how close T is to S. Thus, Accuracy (dj) = 
\T n S\/\T U S\. The standard IR measures of Precision (P), Recall (R) and 
Fi are defined in the multilabeled classification setting as P{dj) = |T fl S'|/|5'|, 
R{d,) = |Tn5|/|T|, and F,{d,) = 2P{d,)Rid,)/iP{d,) + R(d,)). 



5.3 Overall Comparison 

Figures 2 and 3 shows the overall comparison of the various methods on the 
Reuters and Patents datasets. Figure 2 shows comparison on all 135 classes of 
Reuters as well as results of averaging over 10 random train/test splits on a subset 
of 30 classes. Figure 3 shows the comparison for all 114 subclasses of Patents 
and average over 10 random train/test splits on the F class sub-hierarchy. For 
both datasets we see that SVM-HF has the best overall accuracy. SVM has the 
best precision and ConfMat has the best recall. We also observe that BandSVM 
and SVM-HF are very comparable for all measures. 

We did a directional t-test of statistical significance between the SVM and 
SVM-HF methods for the 30 class subset and the F sub-hierarchy. The accuracy 
and Fl scores of SVM-HF were 2% better than SVM, being a small but sig- 
nificant difference at 95% level of significance. The t values were 2.07 and 2.02 
respectively over the minimum required value of 1.73 for df = 18. 
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30 class subset 


All 135 classes 


Method 


Accuracy 


Precision 


Recall 


FI 


Accuracy 


Precision 


Recall 


FI 


SVM 


82.02 


92.65 


82.47 


87.26 


81.26 


92.41 


82.45 


87.15 


ConlMat 


76.16 


81.64 


88.00 


84.7 


80.92 


87 


88.37 


87.68 


Bands VM 


83.18 


89.87 


87.41 


88.63 


81.73 


88.44 


87.54 


87.99 


SVM-HF 


84.25 


91.56 


86.94 


89.19 


82 


88.66 


87.27 


87.96 



Fig. 2. The Reuters-21578 dataset 





F class sub-hierarchy 


All 114 subclasses 


Method 


Accuracy 


Precision 


Recall 


FI 


Accuracy 


Precision 


Recall 


FI 


SVM 


66.65 


73.65 


67.57 


70.48 


42.47 


56.76 


43.37 


49.16 


ConlMat 


66.62 


69.70 


70.63 


70.16 


41.67 


53.40 


51.65 


52.51 


Bands VM 


67.30 


72.09 


68.90 


70.45 


43.30 


55.24 


48.61 


51.70 


SVM-HF 


68.86 


72.06 


69.78 


70.90 


44.41 


55.35 


49.84 


52.45 



Fig. 3. The Patents dataset 



5.4 Interpreting Co-efRcients 

With all documents scaled to unit L 2 norm, inspecting the components of w along 
the label dimensions derived by SVM-HF gives us some interesting insights into 
various kinds of mappings between the labels. The signed components of w along 
the label dimensions represent the amount of positive or negative influence the 
dimension has in classifying documents. As an example for the Reuters dataset, 
the label dimension for grain (-1-8.13) is highly indicative of the class grain. Wheat 
(-1-1.08) also has a high positive component for grain, while money-fx (—0.98) 
and sugar ( — 1.51) have relatively high negative components. This indicates that 
a document getting classified as wheat is a positive indicator of the class grain, 
and a document classified as sugar or money-fx is a negative indicator of the 
class grain. 



5.5 Comparing Number of Labels 

Figure 4 shows the size of the true set of labels T, and the predicted set S. We 
fix [S'! to be 1, 2, 3 for each |T| = 1,2,3. For instance, for |T| = 1, jS”! = 1 for 
99% of the instances for the SVM method, and only 1% of the instances are 
assigned [S'! = 2. For singleton labels, SVM is precise and admits only one label 
whereas other methods admit a few extra labels. 

When |T| = 2, 3, we see that SVM still tends to give lesser number of pre- 
dictions, often just one, compared to the other methods which have a high per- 
centage of instances in the |T| = [S'! column. One reason for this is the way 
one-vs-others is resolved. All negative scores in one-vs-others are resolved by 
choosing the least negative score and treating this as positive. This forces the 
prediction set size to be 1 and the semantics of least negative is unclear. The 
percentages of documents assigned all negative scores by SVM is 18% for 30 
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Corresponding S= 


T=1 


T=2 


n 

II 


1 


2 


3 


1 


2 


3 


1 


2 


3 


SVM 


0.99 


0.01 


0 


0.5 


0.5 


0 


0.52 


0.35 


0.13 


ConfMat 


0.83 


0.14 


0.03 


0.27 


0.63 


0.1 


0.17 


0.3 


0.48 


BandSVM 


0.89 


0.09 


0.01 


0.32 


0.64 


0.03 


0.22 


0.3 


0.43 


SVM-HF 


0.94 


0.06 


0.01 


0.34 


0.63 


0.02 


0.3 


0.26 


0.39 



Fig. 4. Percentage of instances with various sizes of S for T=l,2,3 with 30 classes of 
Reuters. Here, 68% of all test instances in the dataset had T=l; 22% had T=2; 8% 
had T=3; others had T greater than 3. 

classes of Reuters, while ConfMat, BandSVM, and SVM-HF assign all negative 
scores to only 4.94%, 6.24%, and 10% of documents respectively. 



6 Related Work 

Limited work has been done in the area of multi-labeled classification. Crammer 
et al. [6] propose a one-vs-others like family on online topic ranking algorithms. 
Ranking is given by Wc- • x where the model for each class Wc^ is learnt similar to 
perceptrons, with an update of ^Wc^ in each iteration, depending on how imperfect 
ranking is compared to the true set of labels. Another kernel method for multi- 
labeled classification tested on a gene dataset is given by Elisseeff et al. [7]. 
They propose a SVM like formulation giving a ranking function along with a 
set size predictor. Both these methods are topic ranking methods, trying to 
improve the ranking of all topics. We ignore ranking of irrelevant labels and try 
to improve the quality of SVM models for automatically predicting labels. The 
ideas of exploiting correlation between related classes and improving the margin 
for multi-label classification are unique to our paper. 

Positive Example Based Learning-PEBL [8] is a semi-supervise learning me- 
thod similar to BandSVM. It also uses the idea of removing selected negative 
instances. A disjunctive rule is learned on features of strongly positive instances. 
SVMs are iteratively trained to refine the positive class by selectively removing 
negative instances. The goal in PEBL is to learn from a small positive and a large 
unlabeled pool of examples which is different from multi-labeled classification. 

Multi-labeled classification has also been attempted using generative models, 
although discriminative methods are known to be more accurate. McCallum [9] 
gives a generative model where each document is probabilistically generated 
by all topics represented as a mixture model trained using EM. The class sets 
which can generate each document are exponential in number and a few heu- 
ristics are required to efficiently search only a subset of the class space. The 
Aspect model [10] is another generative model which can be naturally employed 
for multi-labeled classification, though no current work exists. Documents are 
probabilistically generated by a set of topics and words in each document are 
generated by members of this topic set. This model is however used for unsu- 
pervised clustering and not for supervised classification. 
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7 Conclusions 

We have presented methods for discriminative multi-labeled classification. We 
have presented a new method (SVM-HF) for exploiting co-occurrence of classes 
in label sets of documents using iterative SVMs and a general kernel function for 
heterogeneous features. We have also presented methods for improving the mar- 
gin quality of SVMs (BandSVM and ConfMat). We see that SVM-HF performs 
2% better in terms of accuracy and Fi than the basic SVM method; a small but 
statistically significant difference. We also note that SVM-HF and BandSVM are 
very comparable in their results, being better than ConfMat and SVM. ConfMat 
has the best recall, giving the largest size of the predicted set; this could help a 
human labeler in the data creation process by suggesting a set of closely related 
labels. 

In future work, we would like to explore using SVMs with the positive set 
containing more than one class. The composition of this positive set of related 
candidate classes is as yet unexplored. Secondly, we would like to theoretically 
understand the reasons for accuracy improvement in SVM-HF given that there is 
no extra information beyond terms and linear combinations of terms. Why should 
the learner pay attention to these features if all the information is already present 
in the pure text features? We would also like to explore using these methods in 
other application domains. 

Acknowledgments. The first author is supported by the Infosys Fellowship 
Award from Infosys Technologies Limited, Bangalore, India. We are grateful to 
Soumen Chakrabarti for many helpful discussions, insights and comments. 
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Abstract. Clustering a large amount of high dimensional spatial data 
sets with noises is a difficult challenge in data mining. In this paper, we 
present a new subspace clustering method, called SCI (Subspace Clus- 
tering based on Information), to solve this problem. The SCI combines 
Shannon information with grid-based and density-based clustering tech- 
niques. The design of clustering algorithms is equivalent to construct 
an equivalent relationship among data points. Therefore, we propose an 
equivalent relationship, named density-connected, to identify the main 
bodies of clusters. For the purpose of noise detection and cluster bound- 
ary discovery, we also use the grid approach to devise a new cohesive 
mechanism to merge data points of borders into clusters and to filter out 
the noises. However, the curse of dimensionality is a well-known serious 
problem of using grid approach on high dimensional data sets because 
the number of the grid cells grows exponentially in dimensions. To strike 
a compromise between the randomness and the structure, we propose an 
automatic method for attribute selection based on the Shannon infor- 
mation. With the merit of only requiring one data scan, algorithm SCI 
is very efficient with its run time being linear to the size of the input 
data set. As shown by our experimental results, SCI is very powerful to 
discover arbitrary shapes of clusters. 

Index Terms-. Data clustering, subspace clustering, Shannon information 



1 Introduction 

Data mining has attracted a significant amount of research attention to discover 
useful knowledge and information. Data clustering is an important technique on 
data mining with the aim of grouping data points into meaningful subclasses. 
It has been widely used in many applications such as data analysis, pattern 
recognition, and image processing. 

Spatial data possesses much information related to the space. Increasingly 
large amounts of data are obtained from satellite images, medical equipments, 
multimedia databases, etc. Therefore, automated knowledge discovery in spatial 
data sets becomes very important. It is known that most existing clustering al- 
gorithms, such as fc-means, fc-medoids single-link and complete-link, have several 
drawbacks when being applied to high dimensional large spatial data sets [10]. 
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The grid approach partitions the data space into disjoint grid cells. There 
is high probability that all data points belong to the same cluster within the 
same grid cell. Therefore, all data points that fall into the same cell can be 
aggregated and treated as one object. All of the clustering operations, which 
are performed on the grid structure, can be improved in terms of the processing 
time [1] [9]. Note that the total number of the grid cells is independent of the 
number of data points. On the other hand, each cluster has a typical density of 
data points that is higher than that of those points outside of the cluster [6]. 
In addition, there are some space relationships among the data points of spatial 
data sets. The data points inside a neighborhood for a given radius of a certain 
point shall be more similar to each other than those outside the neighborhood. 
Hence, the density-based approach can capture the main bodies of clusters [9] 
[14]. Another benefit of those approaches is the independence of the input order 
of the data points, because the merging of cells into cluster depends only on 
their space relationships and the densities. Some clustering algorithms, such as 
Grid clustering [13], STING[15], GLIQUE[1], and WaveGluster[14], are proposed 
to integrate the density-based and grid-based clustering techniques. However, 
there still exist some problems to solve, such as overfitting phenomenon, the 
confusion between noises (or outliers) and border points of clusters, and curse 
of dimensionality, to name a few. 

In most related grid-based and density-based clustering algorithms, such as 
GLIQUE and STING, the two grid cells ui,U2 are connected if they have a 
common face or there exists another grid cell U3 such that u\ is connected to M3 
and U 2 is connected to U 3 [1] . A cluster is defined as a maximal set of connected 
dense grid cells. However, those methods suffer from some problems. The first 
is the overfitting problem. These methods are likely to divide some meaningful 
clusters into many subclusters, especially while the data points spread along the 
direction are not parallel to the axes. For example, consider the scenario of a 
2-dimensional data set with one cluster, as shown in Fig. 1(a). The output of 
the number of clusters on this case is 3 by using GLIQUE or STING. The result 
is deemed unsuccessful since the phenomenon of overfitting occurs. To remedy 
this, we shall redefine the meaning of the connection between two grid cells in 
our work, as will be described in detail later. 

The second problem is the detection of the boundary of clusters and noises. 
The prior grid-based and density-based clustering algorithms are designed in 
particular to deal with dense grid cells. Sometimes the points of cluster’s bound- 
ary may not fall into any dense cells, as in the example shown in Fig. 1(b). Hence, 
we cannot separate the cluster boundaries from noises. Then, most of the cluster 
boundaries are parallel or vertical to the axis, and some points of borders may 
be lost. As a result, we design a new cohesive mechanism, based on the grid 
structure, to detect the boundaries of clusters and to filter out the outliers. In 
addition, the grid approach is vulnerable to high dimensional data sets, because 
the number of grid cells is of the exponential order of the dimensions. We offer 
a new approach to solve this problem by using Shannon information. The Shan- 
non information, also referred to as entropy, is a measure of the uncertainty [2]. 
Note that structure diminishes uncertainty and randomness confuses the bor- 
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ders. Clustering algorithms group data points into clusters usually in such a way 
that the structural relationship between those assigned to the same cluster tends 
to be stronger than those in different clusters. The data points should be of less 
randomness within the same cluster, and of less structural relationship to those 
data points in other clusters. Therefore, the attribute is considered redundant if 
its entropy is too large or too small. 
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(a) (b) 



Fig. 1. (a) The overfitting phenomenon by CLIQUE or STING, (b) the boundary is 
difficult to detect. 



To strike a compromise between the randomness and the structure, we pro- 
pose an automatic method for attribute selection based on the Shannon informa- 
tion. The proposed subspace clustering method is referred to as algorithm SCI, 
standing for Subapace Clustering based on Information. With the merit of only 
requiring one data scan, algorithm SCI is very efficient with its run time being 
linear to the size of the input data set. In addition, SCI has several advantages, 
including being scalable to the number of attributes, robust to noises, and inde- 
pendent of the order of input. As shown by our experimental results, SCI is very 
powerful to discover arbitrary shapes of clusters. In a specific comparison with 
algorithms CLIQUE, CURE, and fc-means, algorithm SCI is able to achieve the 
best clustering quality while incurring even shorter execution time. 

The rest of the paper is organized as follows. Algorithm SCI is developed in 
Section 2. In Section 3, we present some experimental results. The conclusion is 
given in Section 4. 



2 Subspace Clustering Based on Information 

As mentioned before, some conventional grid-based and density-based clustering 
algorithms are likely to have the following disadvantages: (1) overfitting phe- 
nomenon, (2) the confusion between noises (or outliers) and border points of 
clusters, and (3) the curse of dimensionality. To solve those problems, we pro- 
pose a new efficient and effective subspace clustering method on high dimensional 
large data sets with noises, called SCI (Subspace Clustering based on Informa- 
tion) . In essence, algorithm SCI is a hierarchical and single scan subspace clus- 
tering algorithm. A large dataset is compressed into a highly condensed table. 
Each cell of this table just stores the count of data points in the corresponding 
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grid cell of data space. The main memory only stores this data table (called 
summary table) and the data points of spare cells. Such implementation enables 
this clustering algorithm to be effective and scalable. 



2.1 Definitions 

Given a set of d-dimensional data set D, our main task is to partition this set 
into meaningful subclasses. Assume that the spatial data set D has d attributes 
(A'i,A 2 ,--- ,Xd), and each attribute domain is bounded and totally ordered. 
Without loss of generality, we assume that the domain of attribute is a half 
open interval [Lj,Uj) for some numbers Lj^Uj,j = 1,2,- •• , d. Therefore, the 
d-dimensional numerical space X = [Li,f7i) x [^ 2 ,^ 2 ) x ••• x [Ld,Ud) denotes 
a data space. 

Definition 1. The k-regular partition is a partition, which divides the data 
space X into disjoint rectangular cells. The cells are obtained by partitioning 
every attribute domain into k half open intervals of equal length. For simplify, 
we let such intervals be right-open. 

The fc-regular partition divides the d dimensional spatial data set D into k‘^ 
cells with the same volume. The domain of each attribute Xi is divided into bins 
of length 8i = {Ui — Lf)/k. The attribute X^ is the union of the k disjoint right- 
open intervals lij = [Li {j — l)Si, Li jSi), j = 1, • • • ,k. Each grid cell has a 
unique index. The (ii, 12 , • • • , *d)th cell, B(ii, ^ 2 , • • • , fd), is Ii,nXh,i 2 x ‘ ‘ -xld.u- 
Since each cell has the same volume, the number of data points inside it can be 
used to approximate the density of the cell. For ease of exposition, we henceforth 
assume that the data space of the data set D has been partitioned into disjoint 
equal volume grid cells by using fc-regular partition. 

Definition 2. The neighboring cells of the cell B are the cells that are adjacent 
to B. The cell Bi neighbors with Bj if Bj is one of the neighbor cells of Bi. 

The set of neighbor cells of B contains two classes: (1) those have a common 
boundary side with B and (2) those are adjacent with B on single point. Each 
grid cell has 2x2'^ neighbor cells except for the cells of boundary of the data 
space. 

Definition 3. Let msup be the predetermined density threshold. A grid cell is 
called (k,m_sup)~ dense cell if it contains at least msup data points. Otherwise, 
we call that cell {k,m_sup) -sparse cell. 



Definition 4. Each (k,m_sup) -dense cell is density-connected with itself. Two 
{k,msup) -dense cells Bi and Bj are density- connected if they are neighbors of 
each other or if there exists another (k,m_sup) -dense cell B^ such that Bi is 
density- connected to Sj, and is density-connected to Bj. 

The equivalent relationship in [7] and the density-connected relationship lead 
to the following theorem. 
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Theorem 1. The density -connected relationship is an equivalent relationship on 
the set of dense cells. 

An equivalence relation on a set induces a partition on it, and also any 
partition induces an equivalence relation [7]. The main work of the clustering 
algorithm is to assign each data point to exactly one of the cluster. Hence, to 
implement a clustering algorithm is equivalent to construct a partition func- 
tion. Density-connected relationship is an equivalent relationship. We can use 
this equivalent relationship to partition the set of dense cells into some disjoint 
subclasses. In addition, the dense cells will cover most points of the data sets, 
and the use of density-connected relationship is helpful to classify and to iden- 
tify the main body of each cluster. The remaining work is hence to discover the 
boundaries of clusters. 

Definition 5. A {k,m_sup) -sparse cell is called isolated if its neighboring cells 
are all (k,m sup) -sparse cells. The data points, which are contained in some 
isolated sparse cells, are defined as noises. 

A proper density-connected equivalent subclass usually contains most data 
points of each cluster. The data points of sparse cell may be either the boundary 
points of some cluster or outliers. The property for each sparse cell to be isolated 
or not is an important key to recognize border points and outliers. Outliers are 
essentially those points of isolated sparse cell. The data points in non-isolated 
sparse cells are border points of clusters. They might belong to distinct clusters 
but lie inside the same sparse cell. We link those data points of non-isolated 
sparse cell to their nearest dense neighbor cell, point by point. We shall assign 
each data point of non-isolated sparse cells into the cluster which contains its 
nearest dense neighborhood cell. To this end, we have to define the nearest 
neighborhood cell of data point. 

If Vj j = 1, 2, • • • ,2^ are the vertices of grid cell B, then Cg = 
is its geometrical center point. Note that the center point of some grid cell may 
not be the distribution center (mean) of data points on this cell. 

Definition 6. Given a distance metric d of data points, the distance function 
between data point x and the cell B is defined as the distance between x and the 
geometrical center point of B, i.e. dB{x,B) = d{x,c*^) where c% is the geomet- 
rical center point of B . The nearest neighbor cell of a data point x is the cell 
whose distance to x is the minimum among all neighbor cells of x. 

2.2 Algorithm SCI 

Formally, algorithm SCI consists of the following four procedures: 

1. Partition: Create the regular grid structure. 

2. Counting: Count the number of data points in each grid cell. 

3. Purify: remove the redundant attributes. 

4. CD clustering: identify the clusters. 

Partition Stage: The partition process partitions the data space into cells using 
fc-regular partition. For each data point, by utilizing the attribute domain and 
the value of k, we can obtain the index of the corresponding grid cell. 
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Table 1. Algorithm SCI 



Algorithm: SCI 
Inputs: m^sup, k, I* 

1. Partition: We partition the data space, using fc-regular partition, into disjoint 

rectangular cells. 

2. Counting: for each ij = 1, ■ ■ ■ , k\ j = 1, - ■ ■ ,d 

a. Count the number of data points that are contained in the cell 

,*d)- 

b. Assume that - ■ ■ ,id) has data points. 

We mark B(i\,i 2 , - ■ ■ ,id) as {k,m_sup)-dense cell if > m_sup, 

otherwise mark it as {k, m^sup) -sparse cell. 

S.Purify: 

a. Compute the information of each attribute. 

b. Prune the attribute Xj if Info(Xj) > log 2 k — I* or Info(Xj) < I* . 

4.GD clustering: 

4.1 Noises and boundary detection: 

a. Link the data points of non-isolated (k, m_sup)-sparse cells to their nearest 
dense neighbor. 

b. Mark the data points of isolated (k, m_sup)-sparse cells as outlier. 

4.2 Clustering: Find clusters. 



Counting Stage: On the counting process, we scan the dataset once and count 
the number of data points of each cell. A large dataset is compressed into a 
highly condensed summary table. Each cell of this table just stores the count of 
data points in the corresponding grid cell of data space. We classify these grid 
cells of data space as either dense or sparse, and further classify sparse as either 
isolated or non-isolated. 

Purify Stage: Note that we are interested in automatically identifying sub- 
space of a high dimensional data space that allows better clustering of the 
data points than the original space. By only using the summary table of data 
sets, we can avoid multiple database scans. Let N be the number of input data 
points. After the fc-regular partition process, the data space will be partitioned 
into k‘^ grid cells {B{ii,i2, - ■ ■ I j = ,d}. For each 

j, the domain of attribute Xj is divided into k disjoint equal width intervals 
{Iji I i = 1, • • • ,k}. If the grid cell B(fy,i 2 ,--- ,fy) has data points, 

then Cji = ^ data points are projected on the interval ^3l under 

ij —I 

Xj dimension. (Note that the value of Cji can also be counted at the scanning 
time.) 

The entropy relationship [2] leads to the following useful lemma for our work. 

Lemma 1. a. The Xj dimension has Shannon information 
Info{Xj) = -Ef=i ^log 2 {^) bits. 
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b. 0 < Info{Xj) < log 2 {k). 

c. Info{Xj) = 0 ^ 31 such that ^ = 1 and = 0 Vr 7 ^ /. 

d. Info{X^) = log 2 {k) ^ yi^ = l- 

If the data points projected on some attribute dimension are spatially close to 
one another, then we cannot significantly discern among them statistically. This 
attribute holds less information. If we prune this attribute and cluster again 
under the lower dimension, the clustering result would not be much changed. 
On the other hand, if the distribution of the data points projected on some at- 
tribute dimension is uniform, then this attribute has more information but less 
group structure. Therefore, such an attribute is also useless for clustering and 
is subject to be pruned. Explicitly, recall that structure diminishes uncertainty 
and randomness confuses the borders of clusters. The Shannon information is 
a commonly used measure of uncertainly. Attributes of less Shannon informa- 
tion carry less boundary information among groups, but more group structure 
information. In contrast, attributes of more Shannon information carry more 
boundary information, but possess less group structure information. Then, we 
can prune those attributes whose Shannon information is either too large or too 
small. 

Example 1. In Fig. 2(a), the distribution of data points projected on dimension 
X is compact, and dimension X hence has less Shannon information. In Fig. 2(b), 
the projection of data points on dimension X is uniform, and then dimension 
X has more Shannon information, which, however implies less group structure 
information. In both scenarios, the subspace with dimension X pruned allows 
better clustering of the data points than the original space. 

To decide which subspace is interesting, we apply Lemma 1 to our work. 
Assume that I* is a predetermined small positive threshold (called significant 
level), then we prune the attribute Xj if Info{Xj) > log 2 k — I* or Info(Xj) < 
I*. 

GD Clustering Stage: The GD clustering satge consists of the following two 
sub-processes: 

1. Noises and boundary detection: Classify the clustering boundary points and 
noises. 

2. Clustering: Find clusters. 

We categorize the data points of a cluster into two kinds, points inside some 
dense cells and points inside some sparse cells. The core points in a cluster are 
a set of points inside some density-connected cells. Some border points may 
be falling in some spare cell, and near the core points of cluster. A cluster is 
a maximal of density-connected cells with some linked points of their spare 
neighbors. The main task of outlier detection process is to classify the data points 
of sparse cells into border points or outliers. The final process is to determine the 
maximal density-connected equivalent classes by merging the dense neighbors. 
Each equivalent class integrating with the linking border points is a cluster. 

Note that the centroid of cluster may be out of cluster. An illustrative ex- 
ample is shown in Fig. 2(c). Similarly, the centroid of the region of density- 
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Fig. 2. (a) Dimension X has little Shannon information, (b) dimension X has much 
Shannon information but less group structure information, (c) the centroid of cluster 
is out of cluster. 



connected equivalent class may be out of region. If we merge border points to 
the density-connected equivalent class whose centroid is nearest, we are likely to 
have incorrect merges sometimes. To solve this problem, we shall merge border 
points to their nearest neighbor cell. Moreover, to avoid the computations of 
cell center points, we can merge border points to the cell with nearest boundary 
distance among neighbors without compromising the quality of results. 

3 Experimental Evaluation 

To assess the performance of algorithm SCI, we have conducted a series of ex- 
periments. We compare the cluster quality and the execution time of SCI with 
several well-known clustering methods, including CLIQUE, CURE, and fc-means 
algorithms. The synthetic sample data sets of our experiments are shown in Fig. 
3. We added some random noises for evaluating the noises detection mechanism. 
For algorithms CLIQUE and SCI, the granularity of partition k is set to 25 and 
the density threshold msup is 1.1 times the expectation of data points of each 
grid cell under uniform distribution, i.e. 1.1 x (iV/25^). The clustering results 
of those algorithms are shown in http://arbor.ntu.edu.tw/~ming/sci/ where al- 
gorithm SCI is able to successfully group these data sets. Further, the noises 
detection mechanism has been automatically triggered under algorithm SCI. 
From these results, the CURE and k-means algorithm are not able to capture 
the spatial shapes of clusters successfully. In fact, only algorithm SCI is able to 
detect noises successfully. Algorithm CLIQUE tends to remove too many noises. 
Because algorithm CLIQUE is not equipped with any detection mechanism to 
depart noises from the sparse cells, some border points are classified as noises. 
Also, parts of the cluster boundaries of CLIQUE are parallel or vertical to the 
axis. As shown in Fig. 4(b) and Fig. 4(d), there are overfitting phenomena on 
the resulting clusters of CLIQUE. From these results, it is shown that SCI is 
very powerful in handing any sophisticated spatial data sets. 

The average execution times for those data sets are shown in Table 2 where 
it can be seen that grid-based and density-based approaches, i.e. CLIQUE and 
SCI, are more efficient than others. With the refinement on the definition of 
connected neighbor for grid cells, SCI is able to attain better clustering results 
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than CLIQUE, as well as to improve the search of coirnected compoireirts. From 
these experiments, it is noted that most execution time of SCI is on the noises 
aird bouirdary detectioir, which iirdeed leads to better results. 



^ 

(a) I)a.ta Set 1 (h) Data Set 2 (r.) Data Set 3 (d) Data Set 4 

Fig. 3. Data sets used in experiments. 




Fig. 4. Some clustering results of SCI and CLIQUE. 



Table 2. Execution time (in seconds) for the clustering algorithms. 
(Note that CLIQUE does not have outliers detection.) 





Data Set 1 


Data Set 2 


Data Set 3 


Data Set 4 


Number of data 


12,607 


14,348 


16,220 


11,518 


SCI 


0.87 


1.04 


0.92 


0.83 


CLIQUE 


0.33 


0.28 


0.38 


0.33 


CURE 


487.08 


624.29 


797.35 


406.05 


fc-means 


1.15 


1.93 


0.94 


2.75 



4 Conclusions 

In this paper, we developed algorithm SCI which integrates Shannon information 
and grid-based and density-based clustering to form a good solution for subspace 
clustering of high dimensional spatial data with iroises. With the merit of only 
requiring one data scan, algorithm SCI is very efficient with its run time being 
liirear to the size of the iirprit data set. In additioir, SCI was showir to have 
several advantages, including being scalable to the number of attributes, robust 
to noises, and iirdependeirt of the order of input. As shown by our experimeirtal 
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results, SCI is very powerful to discover arbitrary shapes of clusters and able to 
solve the overfitting problem and avoid the confusion between outliers (noises) 
and cluster boundary points. 
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Abstract. This paper proposes a two-step graph partitioning method to discover 
constrained clusters with an objective function that follows the well-known 
min-max clustering principle. Compared with traditional approaches, the 
proposed method has several advantages. Firstly, the objective function not 
only follows the theoretical min-max principle but also reflects certain practical 
requirements. Secondly, a new constraint is introduced and solved to suit more 
application needs while unconstrained methods can only control the number of 
produced clusters. Thirdly, the proposed method is general and can be used to 
solve other practical constraints. The experimental studies on word grouping 
and result visualization show very encouraging results. 



1 Introduction 

As a widely recognized technique for data analysis, clustering aims at gathering 
closely related entities together in order to identify coherent groups, i.e., clusters. 
Clustering methods have proven to be very useful in many application areas including 
data mining, image processing, graph drawing, and distributed computing. This paper 
presents a novel graph theoretic partitioning approach to constrained clustering, 
analyzes and demonstrates the advantages of such an approach. 

For constrained clustering, grouping similar units into clusters has to satisfy some 
additional conditions. Such additional conditions come from two kinds of knowledge: 
background knowledge and user requirements. While there have been some works 
investigating the use of background knowledge in clustering process, little research, 
however, performs in-depth analysis on the role of user-inputs in the process of 
clustering. According to the role of user-input in the clustering process, clustering 
criteria can be classified into two categories: user-centric and data-centric. The former 
involves and solves practical constraints in clustering process while the latter meets 
only the application-independent requirements such as high cohesiveness, low 
coupling, less noise, and etc. A practical clustering method should be both user- 
centric and data-centric, e.g., in most clustering algorithms the number of the clusters 
to be discovered is a necessary user-input while the application-independent min-max 
clustering principle must hold: the similarity between two clusters is significantly less 
than the similarity within each cluster [3]. 
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The number of user-input constraints varies for different applications. In spatial 
data clustering, user-input can be minimized because there are naturally-defined 
potential clusters contained in the given data and finding these clusters is exactly the 
final purpose. In many other cases, however, finding natural clusters is far from the 
destination, which makes it not enough to meet only one constraint on the number of 
the clusters. Let us use a simple example in Fig. 1 to demonstrate the necessity of 
involving more constraints. 




Fig. 1. (a) A population-density map with 12 communities, (b) A 4-clustering of the 12 
communities, (c) A 4-clustering with a constraint on the distance between cluster centers. 



Fig. 1 (a) is a map that describes the population distribution in a city. The denser 
the color is the more crowded the people are. A builder is planning to build some 
supermarkets in this city. He wants to choose four profitable locations of the 
supermarkets according to this map. So he uses some clustering algorithms to 
discover the clusters of people in order to put his supermarkets at the centers of 
clusters. In Fig. I (a), we can see that a correct and “good” data-centric clustering 
algorithm without any user-input would produce the twelve communities in this city 
as twelve clusters. Such a clustering result, unfortunately, is of little use because the 
builder can afford only four supermarkets. Fig. 1 (b) illustrates a feasible result after 
accepting the constraint on the number of clusters: the four stars indicate the best 
locations of the four supermarkets. The result shown in Fig. 1 (b), however, cannot 
satisfy the builder either. The builder needs a bigger distance between two 
supermarkets so that they can cover more area. After accepting one more constraint 
about the minimal allowable distance between two supermarkets. Fig. 1 (c) shows a 
desirable result. Fig. 1 has revealed an important phenomenon in many clustering 
applications: the clustering results that satisfy only the constraint on the number of 
clusters may not meet the practical requirements. It is necessary to involve more 
constraints into clustering process. 

This paper introduces a new constraint into the generic clustering problem: the 
upper bound of the quantified similarity between two clusters. The similarity may 
have different quantified values in different application domains. In the above 
supermarket example, the similarity between two clusters is represented by the 
distance between two clusters: the nearer, the more similar. The upper bound of the 
similarity between two clusters is the minimum allowable distance between their 
centers. We believe that the upper bound of similarity is a general constraint which is 
required by many clustering applications. To support the argument, we further 
provide an example on parallel task partitioning: as we know, computers in a 
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distributed environment lack global addressing space, communication has to be 
inserted whenever a processor needs to access non-local data. For example, on the 
Intel Paragon the processor cycle time is 20 nanoseconds whereas the remote memory 
access time is between 10000 and 30000 nanoseconds [8], depending on the distance 
between communicating processors. Therefore, it is imperative that the frequency and 
volume of non-local accesses are reduced as much as possible. Suppose that the whole 
computing task is composed of n smaller tasks. Given k machines/processors, the 
attempt to assign the n small tasks to the k machines/processors is a A:-clustering 
problem, which has several concerns: firstly, there is a communication bottleneck 
between the machines/processors. Secondly, it may not be worth to parallelize the 
computing task when the ratio of the communication cost to the total cost exceeds a 
preset threshold. Thirdly, in most cases, the number of available machines/processors 
is not always fixed but flexible. The three concerns can be exactly mapped into three 
corresponding constraints in clustering problem: the upper bound constraint, the min- 
max principle, and the number of clusters. 

Motivated by the above examples, this paper proposes a graph theoretic model 
that can represent the above three requirements (two user-input constraints and one 
application-independent min-max principle): 1) the desired number of clusters; 2) the 
objective function of clustering that reflects the min-max principle, and 3) the upper 
bound of the similarity between two clusters. In particular, the desired number of 
clusters in the proposed model is represented with a range K„,ax), minimum and 
maximum allowable number of clusters, respectively. The objective function is 
defined as the ratio of the similarity within the cluster to the similarity between 
clusters. Maximizing the proposed objective function not only follows the min-max 
principle but also meets certain practical requirements, e.g., in parallel task 
partitioning, the ratio decides if it is worth to parallelize the computation task; in 
spatial data clustering, the objective function is the ratio of the density inside the 
cluster to the density outside the cluster when representing the spatial data sets as 
sparse graphs. Based on the three requirements, this paper proposes a two-step graph 
partitioning methodology to discover constrained clusters. The basic idea involves 
node sequencing and then partitioning the node sequence according to the constraints. 
In the sequencing process, the graph nodes are sorted using existing algorithms. In the 
partitioning process we find an appropriate cut point according to the objective 
function along the ordered node sequence so that all points on one side will be output 
as a cluster while all points on the other side will remain for further partitioning until 
all constraints are satisfied or the number of produced clusters exceeds K^ax- 

The rest of this paper is organized as follows. Related work on graph partitioning 
and constrained clustering is introduced in Section 2. Section 3 proposes our two-step 
methodology. The experimental studies on word grouping and result visualization are 
presented in Section 4. Section 5 concludes the paper. 



2 Related Work 

Since this paper focuses on a constrained graph partitioning for data clustering, the 
related work can be categorized into two parts: graph partitioning and constraint- 
based data clustering. 
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2.1 Graph Partitioning 

Numerous graph partitioning criteria and methods have been reported in the literature. 
We consider matrix transformation and ordering since our method proposes similar 
techniques. The optimal solution to the graph partitioning problem is NP-complete 
due to the combinatoric nature of the problem [3, 4]. The objective of graph 
partitioning is to minimize the cut size [3], i.e., the similarity between the two 
subgraphs with the requirement that the two subgraphs have the same number of 
nodes. It is important to note that minimizing the cut size for each partitioning step 
cannot guarantee a minimal upper bound of the cut size for the whole clustering 
process. 

Hagen and Kahng [6] remove the requirement on the sizes of the subgraphs and 
show that the Fiedler vector provides a good linear search order to the ratio cut (Rcut) 
partitioning criteria, which is proposed by Cheng and Wei [2]. The definition of Rcut 
is: Rcut^cut(A,B)/lAl+cut(A,B)/jBl, where G=(V,E) is a weighted graph with node set 
V and edge set E, cut(A,B) is defined as the similarity between the two subgraphs A 
and B and jAj, jBj denote the size ofA,B, respectively. 

Shi and Malik [10] propose the normalized cut by utilizing the advantages of 
normalized Laplacian matrix: Ncut=cut(A,B)/deg(A)+cut(A,B)/deg(B), where deg(A) 
is the sum of node degrees, which is also called the volume of subgraph A, in contrast 
to the size of A. Ding et al. [3] propose a min-max cut algorithm for graph partitioning 
and data clustering with a new objective function called Mcut~cut(A,B)/W(A) + 
cut(A,B)/W(B), where W(A) is defined as the sum of the weights of the edges belong 
to subgraph^. 

All objective functions above are designed for their algorithms to find an 
appropriate partitioning. They are algorithm-oriented and cannot reflect the practical 
requirements or physical meanings, which make them infeasible to serve a 
constrained clustering. 

2.2 Constraint-Based Data Clnstering 

According to Tung et al., constraint-based clustering [11] is defined as follows. Given 
a data set D with n objects, a distance function df. DXD R, a positive integer k, 
and a set of constraints C, find a k-clustering (Cl],Cl 2 ,...ClfJ such that 

DISP= j disp{Cli, repi) 

is minimized, and each cluster C/, satisfies the constraints C, denoted as Clil=C, 
where disp(Cli, repi) measures the total distance between each object in C/, and the 
representative point repi of C/,. The representative of a cluster C/, is chosen such that 
disp(Cli, repi) is minimized. There are two kinds of constraint-based clustering 
methods, aiming at different goals: one category aims at increasing the efficiency of 
the clustering algorithm while the other attempts to incorporate domain knowledge 
using constraints. Two instance-level constraints: must-link and cannot-link 
constraints have been introduced by Wagstaff and Cardie [12], who have shown that 
the two constraints can be incorporated into COBWEB [5] to increase the clustering 
accuracy while decreasing runtime. Bradley et al. propose a constraint-based k-means 
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algorithm [1] to avoid local solutions with empty clusters or clusters with very few 
data points that can often be seen when the value of k is bigger than 20 and the 
number of dimensions is bigger than 10. 

The proposed method is different from the aforementioned approaches: firstly, we 
combine graph partitioning and constrained-based clustering together. Secondly, the 
proposed partitioning process is different from the traditional partitioning in that for 
each partitioning step we produce only one cluster instead of two subgraphs. The 
remaining part will be further partitioned until all constraints are satisfied or the 
number of produced clusters exceeds K„,ax- Thirdly, the upper bound constraint we 
introduce is new and its involvement does not aim at improving the efficiency of the 
clustering process but aim at encoding the user’s requirements. Finally, we accept 
both unweighted and weighted graphs. Section 3 will introduce the proposed two-step 
methodology. 



3 A Two-Step Methodology 

This section first describes the process of node sequencing, and then introduces the 
node partitioning method. 

3.1 Node Sequencing Method (NSM) 

The process of node sequencing transforms a two-dimensional graph into a one- 
dimensional node sequence. The sequencing method used here is proposed in our 
previous paper [9]. Due to the space limitation, we only provide a brief introduction. 
The whole sequencing includes two steps: coarse-grained sequencing, which 
partitions the given graph into several parts and sorts these parts, and fine-grained 
sequencing, which sorts the nodes inside each part produced in the coarse-grained 
step. As a result, we obtain a sequence of nodes in which the nodes belonging to the 
same cluster will be put together. Then we can apply the node partitioning method to 
the node sequence to find the boundary points of clusters with constraints. 

3.2 Node Partitioning Method (NPM) 

This section proposes a novel node partitioning method. We first introduce the 
algorithm parameters used in the algorithm. 

3.2.1 Algorithm Parameters 

The algorithm parameters used in NPM include alphal, alpha!, beta, and Ei„ter, which 
are defined as follows. Given a node sequence S of n nodes, and a cut at ith node 
separates S into two sub-sequences, say S], and S 2 where Sj contains the nodes from 1 
to i, and Sj contains the nodes from i+1 to n. Let Ei denote the number of edges inside 
Si and E 2 the number of edges inside S 2 , and Ei„t„ the number of edges between Si and 
S 2 , we have the following algorithm parameters, as intuitively shown in Fig. 2. 



alphal(i)~ Ei /(i(i-l)/2) 


(1) 


alpha2(i)~ E2/((n-i)(n-i-l)/2) 


(2) 


beta(i)= Ei„,er /(((n-i)-i)/2) 


(3) 
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As Fig. 2 shows, the big square represents an adjacency matrix of the given graph, 
and the cut at i separates the node sequence l...n into and i+l...n. alphal 

represents the density of the upper left square, 
alphal represents the density of the lower right 
square, and beta represents the density of the 
upper right or lower left square. 

Given a node sequence Sofn nodes, for every 
node i, l<i<n, “cut at /” means partitioning the 
sequence at the ith node. A cut at i is acceptable 
only if its corresponding objective function 
cutvalue(i) is a peak value. A cutvalue(i) is a peak 
value means that cutvalue(i) is bigger than any 
other cutvalue bwteen i-e and i+e where is a 
threshold defined as n/2k and k is the number of 
desired clusters. Cutvalue measures the ratio of the similarity within the cluster and 
the similarity between clusters and helps discovering where we should separate the 
node sequence. Its definition is: 

beta{i)oQ ( 4 ) 

\mAX + alpha l(i) beta (i) = 0 

The MAX in formula (4) is a very big constant used to distinguish the nodes when 
beta(i) is zero. According to the definitions, beta(i) represents the density of inter- 
cluster area while alphal (i) is the density of intra-cluster area. The physical meaning 
of using the ratio of alphal (i) to beta(i) is to effectively reflect the relative density. If 
cutvalue increases significantly, the corresponding cut point is more possibly located 
at the boundary of two clusters. 

Now let us define the upper bound constraint for the similarity between two 
clusters. For clusters C,- and Cj, and nodes ue Cj, ve Cj, if (u,v)eE, we say (u,v) is an 
edge between clusters C, and C,. Let inter(ij) denote the number of edges between 
clusters C, and Cj and sum_inter(i,j) the sum of the weights of the edges between 
clusters Ci and C,. We have the following definitions: 

Definition 3.1 (Coupling Bound, Bound Constraint) 

Coupling Bound is the biggest number of inter-cluster edges for a clustering result, 
represented by an integer Ui„,er- Ui„,er =Max (inter (i,j)), V i,j e {l,2,....,k}, i^J. For 
weighted graphs, coupling bound is the maximal sum of the weights of inter-cluster 
edges, denoted by an integer Ui„,er-w- Ui„,er-w^Max (sum_inter(i,j)), Vije {l,2,...,k}, 
i^. Bound Constraint, denoted by U, is a user-input constraint on the maximum 
allowable coupling bound. For a satisfactory clustering result, the formula Uinter <U 
must hold (for weighted graph, the formula is Ui„,er-w 
Definition 3.2 (Granularity, G-constraint) 

Granularity is defined as the number clusters for a clustering result, denoted by an 
integer k. G-constraint is a pair of integers K„,ak) input by the user. For a 

satisfactory clustering result, the formula K„,ax k K„,i„ must hold. 



1 


i 


Cut at i 


n 




alphal (i) 


beta(i) 






beta(i) 


alpha2 (i) 













Fig. 2. The physical meanings 
of the algorithm parameters 



Constraint-Based Graph Clustering through Node Sequencing and Partitioning 47 



Fig. 3 describes the algorithms 
that computes the outvalue for each 
cut at i and returns the cut with the 
first peak value and the whole NPM 
algorithm. 



4 Experimental Studies 

Our experimental studies consist of 
three parts: the first part evaluates 
min-max principle and demonstrates 
that the parameter values represent 
the data distribution precisely. The 
second part of our experiments 
evaluates the ability of our approach 
on constraint satisfaction. Since there 
are two kinds of constraints involved 
in our algorithm: the upper bound 
constraint and the range of the 
desired number of clusters, the 
second part uses a set of synthetic 
constraints to evaluate if our 
algorithm can find the constrained 
clustering results for the given data 
set. The third part of our experiments 
visualizes the clustering results 
intuitively and compares our results 
with the one produced by Kamada and Kawai’s method, a well-known force-directed 
(spring) graph clustering algorithm. 

The clustering task in our experiments is word grouping, a basic technique of text 
mining and document clustering. The testing data are generated from the subsets of an 
English dictionary. In the first part of our experiments, two data sets: DSl and DS2 
are used. In the second part of our experiments, three data sets: DS3, DS4, and DS5 
are used. The properties of their coiTesponding graphs are shown in Table 1. The 
similarity between two words is defined on their edit distance except for DSl. Each 
word is regarded as a graph node and if the edit distance between two words is less 
than a fixed threshold, there is an edge between the two corresponding nodes. For 
DSl, there is an edge between the two nodes if and only if the two corresponding 
words have the same length. 



Algorithm ComputeCut (Graph g. Integer n) 

begin 

for each i from 2 to n-2 

compute alphal(i), beta(i), and 
outvalue according to (4) 
for each node i from 7 to n do 
cut ^getFirstPeak(cutvalue [i] ); 
return cut, 
end 

Algorithm Npm (NodeSequence seq) {^eq is 

the result after node sequencing} 

begin 

t ^-0, i ^0\ 
while (i<Kmin) do 
begin {the partitioning procedure} 
Remove the nodes before Node[t] from 
seq\ n 

Create the residual graph; 
t ^ComputeCut(g,n); i^+l\ 

end 

while (Uinter>U) && i<Kmax) do 
repeat the partitioning procedure; 
if (Uinter>U) return clustering result; 
else return (“No such kind of clustering”); 

end 

Fig. 3. The ComputeCut Algorithm and the 
whole NPM algorithm 



4.1 Parameter Effectiveness 

Fig. 4 (a) shows the computed values of the two algorithm parameters: alphal and 
alpha! for DSl. According to the similarity definition of DSl, the words with the 
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Table 1. The properties of the five testing data sets 



Testing 

Data 


Number 

ofNodes 


Number 
of Edges 


Threshold of 
Edit Distance 


Result 
shown in 


DSl 


3419 


146284 


N/A 


Fig. 4 


DS2 


472 


428 


1 


Fig. 5 


DS3 


300 


953 


2 


Fig. 6 (a) 


DS4 


1000 


4908 


2 


Fig. 6 (b) 


DS5 


5000 


312834 


2 


Fig. 6 (c) 




(a) (b) 

Fig. 4. (a) Values of alphal and alpha2 for DSl in the first partitioning step (b) Values of beta 
for DSl in the first partitioning step 



same length form a complete graph; the whole graph contains 6 complete subgraphs 
that are isolated from each other. The boundaries of the 6 subgraphs/clusters are 
clearly shown where the values of alphal drops dramatically. 

Fig. 4 (b) shows the computed values of beta. We can see that the values of beta 
reach minimum at the boundaries of clusters while the values of alphal reach 
maximum at the boundaries. According to the definition of cutvalue, the value of 
alphal (i)/b eta (i) would be significantly bigger when i is the index of a boundary 
point, i.e., the cutting point can be correctly found. 




Fig. 5. The values of beta (a) before applying BEA and (b) after applying BEA for DS2. 
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(a) 





Fig. 6. The constraint satisfaction process for (a) DS3 (b) DS4 and (c) DS5 



Fig. 5 shows the difference on the values of beta before and after applying the 
BEA. Fig. 5(a) shows that before applying the BEA the couplings between the data 
points are not very different and we cannot find a cut point to partition the node 
sequence. After applying BEA to the same graph, we find that the difference of 
couplings of different clusters is sharpened while the distribution of the coupling 
values inside the cluster becomes smoother. As clearly shown in Fig 5 (b), the graph 
contains six clusters. 



4.2 Constraint Satisfaction 

This part of experiments evaluates whether the proposed partitioning algorithm can 
produce satisfactory results according to different user-input constraints. Each testing 
graph is evaluated against 8 different clustering requirements and each requirement 
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contains two kinds of constraints, as defined in Section 3. Each of Fig. 6 (a). Fig. 6 
(b), and Fig. 6 (c), contains two sub-figures, con'esponding to the two kinds of 
constraints respectively. The 8 requirements are designed from loose to strict, i.e., the 
first requirement is the easiest to be satisfied and the last is the hardest. If the 
produced granularity is between the minimum and maximum requirement, the 
corresponding clustering result is satisfactory on granularity; if the coupling bound of 
the produced result is below the upper bound constraint, the corresponding clustering 
result is satisfactory on coupling. If both kinds of constraints are satisfied, the 
clustering result is satisfactory. 

4.3 Result Visualization 

Another way to evaluate our approach is to visualize the results. Fig. 7(a) shows the 
original graph with 320 nodes. The graph nodes belong to the same cluster are in the 
same gray level. We apply a popular force-directed graph clustering algorithm: 
Kamada and Kawai’s method [7] to the graph and its result is shown in Fig. 7 (b) 
while our results are shown in Fig. 7 (c) and (d). Although our approach does not 
compete with Kamada and Kawai’s method on the quality of graph layout, it separates 
clusters clearly. In Fig. 7(b) many graph nodes belong to different clusters are mixed 
up while our method can discover all clusters correctly. Fig. 7 (c) and (d) show that 
our method can produce different numbers of clusters according to the user input. 
Apart from the advantage on effectiveness, our method is faster than Kamada and 
Kawai’s method. It takes only several minutes for our program to generate the results 
in Fig. 7 (c) and (d) while Kamada and Kawai’s method needs more than 1 hour to 



(d) 

Fig. 7. (a) The original graph before clustering (b) the same graph after applying Kamada and 
Kawai’s method. The same graph after applying our algorithm with (c) 16 clusters (d) 8 clusters 



reach a stable layout. 
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5 Conclusions 

This paper has presented a novel graph partitioning method for constrained data 
clustering. A new constraint: upper bound of the similarity between two clusters is 
introduced and solved with the proposed graph partitioning method. The method 
consists of two steps: sequencing the given set of graph nodes, and then partition the 
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node sequence into final clusters. This method has at least two advantages: first, the 
objective function not only follows the theoretical min-max principle but also reflects 
certain practical requirements. Second, new constraints fi'om practical clustering 
problems are introduced and solved so that the clustering results can be tailored to 
more application needs while unconstrained methods can only control the number of 
produced clusters. Our experimental studies have visualized the clustering results 
intuitively and demonstrated that the combination of graph partitioning and 
constrained data clustering is successful. Future work will explore whether it is 
possible to locate the feasible range of the constraints for a given clustering task so 
that the user can be guided on constraint input. 
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Abstract. We propose a general framework for the process mining problem which 
encompasses the assumption of workflow schema with local constraints only, for it 
being applicable to more expressive specification languages, independently of the 
particular syntax adopted. In fact, we provide an effective technique for process 
mining based on the rather unexplored concept of clustering workflow executions, 
in which clusters of executions sharing the same structure and the same unexpected 
behavior (w.r.t. the local properties) are seen as a witness of the existence of global 
constraints. 

An interesting framework for assessing the similarity between the original model 
and the discovered one is proposed, as well as some experimental results evidenc- 
ing the validity of our approach. 



1 Introduction 

Even though workflow management systems (WfMS) are more and more utilized in 
enterprises, their actual impact in automatizing complex process is still limited by the 
difficulties encountered in the designing phase. In fact, processes have complex and often 
unexpected dynamics, whose modelling requires expensive and long analysis which may 
eventually result unviable under an economic viewpoint. 

Recent research faced this problem, by exploiting some strategies, called process 
mining techniques, for using the information collected during the enactment of a process 
not yet supported by a WfMS, such as the transaction logs of ERP systems like SAP, 
in order to derive a model explaining the events recorded. Then, the output of these 
techniques, i.e., the “mined” synthetic model, can be profitably used to (re)design a 
detailed workflow schema, capable of supporting automatic enactments of the process. 

Several approaches for process mining have been proposed in the literature (see, 
e.g., [1,16,4,12]), that aim at reconstructing the structure of the process, by exploiting 
graphical models based on the notion of control flow graph. This is an intuitive way of 
specifying a process through a directed graph, where nodes correspond to the activities 
in the process and edges represent the potential flow of work, i.e., the relationships of 
precedence among the activities. 

However, despite its intuitiveness, the control flow completely lacks in the ability of 
formalizing complex global constraints on the executions, which often occurs while 
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modelling real scenarios, for it being able to prescribe only local constraints in terms of 
relationships of precedence. 

In this paper, we extend previous approaches to process mining, by proposing an 
algorithm which is able to discover not only the control flow of a given process, but also 
some interesting global constraints, in order provide the designer with a refined view of 
the process. The main contribution are as follows. 

In Section 2, we formalize the process model discovery problem, in a context in which 
the target workflow schema may be enriched with some global constraints, denoted by 
Cg- In order to decouple the approach from the particular syntax adopted for expressing 
Cg, we exploit the observation that each global constraint leads to instances with a 
specific structure (short, pattern); then, a workflow schema WS^ , accounting for global 
constraints, is the union of several schemas ..., (without global constraints), 

each one supporting the execution of one pattern, only. 

Different patterns of executions (and, hence, WiS'^) are identified by means of an 
algorithm for clustering workflow traces, presented in Section 3, which is based on the 
projection of the traces on a suitable set of properly defined features. The approach is 
similar in the spirit to the proposals of clustering sequences using frequent itemsets, 
but technically more complex, for it deriving a hierarchical clustering. The theoretical 
properties of the algorithm are investigated as well. 

In Section 3.1, we propose a level-wise algorithm for the identification of the set of 
features T for the clustering, and we study the problem of selecting the most ‘represen- 
tative’ subset of T , by showing its intrinsic difficulty. Therefore, we propose a greedy 
heuristic for quickly computing a set of features approximating the optimal solution. 

Finally, we experiment an implementation of the proposed technique, by showing its 
scalability. An interesting framework for assessing the similarity between the original 
model and the discovered one is proposed in Section 4, thus, providing a quantitative 
way for testing the validity of the approach. 



2 F ormal F rame work 

In this section we formalize the mining problem addressed in the paper, which can be 
roughly described as the problem of (re)constructing a workflow model of an unknown 
process P, on the basis of log data related to some executions of the process. 

The control flow graph of a process P is a tuple CJ-{P) = {A, E, oq, P), where A is 
a finite set of activities, E {A — F) x {A — {ag}) is a relation of precedences among 
activities, og € A is the starting activity, P C A is the set of final activities. 

Any connected subgraph I = (Aj, P/) of the control flow graph, such that ag G Aj 
and A/ (T P 0 is a potential instance of P. In order to model restrictions on the 
possible instances, the description of the process is often enriched with some additional 
local or global constraints, requiring, e.g., that an activity must (or may not) directly (or 
indirectly) follow the execution of a number of other activities. 

For instance, local constraints are that an and-join activity can be executed only after 
all its predecessors are completed, and that an or-join activity can be executed as soon 
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as one of its predecessors is completed. Other examples are than an and-split activity 
activates all of its successor activities, while a xor-split activates exactly one of its 
outgoing arcs. 

Global constraints are, instead, richer in nature and their representation strongly de- 
pends on the particular application domain of the modelled process. Thus, they are often 
expressed using other complex formalisms, mainly based on a suitable logic with an 
associated clear semantics. 

Let P be a process. A workflow schema for P, denoted by WS{P), is a tuple 
(CiF(P),Cl(P),Cg(P)), where CT{P) is the control flow graph of P, and Cl{P) 
and Co{P) are sets of local and global constraints, respectively. Given a subgraph / of 
CT{P) and a constraint c in Cl{P) U Co{P), we write I \= c whenever I satisfies c in 
the associated semantics. Moreover, if / ^ c for all c in Cl{P) U Cg{P), I is called an 
instance of >V5(P), denoted by / ^ W5(P). When the process P is clear from the 
context, a workflow schema will be simply denoted by W5 = (CP, Cl , Cg) ■ 

2.1 The Process Model Discovery Problem 

Let Ap be the set of task identifiers for the process P. We assume the actual workflow 
schema W5(P) for P to be unknown, and we consider the problem of properly identi- 
fying it, in the set of all the possible workflow schemas having Ap as set of nodes. In 
order to formalize this problem we need some preliminarily definitions and notations. 

A workflow trace s over Ap is a string in Ap, representing a task sequence. Given a 
trace s, we denote by s[i] the i-th task in the corresponding sequence, and by lenght(s) 
the length of s. The set of all the tasks in s is denoted by tasks{s) = Ui<i<;eng/it(s) 
Finally, a workflow log for P, denoted by Cp, is a bag of traces over Up: Cp = [ s | s G 
A*p ] and is the only input from which inferring the schema W5(P). 

In order to substantiate the problem of mining >ViS(P), one must specify which 
language is to be adopted for expressing the global constraints in Cq- In order to devise 
a general approach, it is convenient to find an alternative (syntax-independent) way for 
evidencing global constraints. The solution adopted in this paper is to replace a unique 
target schema WS{P) with a variety of alternative schemata having no global constraints 
but directly modelling the various execution patterns prescribed by global constraints. 
The basic idea is to first derive from the trace logs an initial workflow schema whose 
global constraints are left unexpressed and, then, to stepwise refine it into a number of 
specific schemas, each one modelling a class of traces having the same characteristics 
w.r.t. global constraints. 

Definition 1. Let P be a process. A disjunctive workflow schema for P, denoted by 
WiS^(P), is a a set {W5^, ..., W5"*} of workflow schemata for P, with = 

(CP^Ci,0), for 1 < j < m. The size of W5^(P), denoted by |>ViS^(P)|, is the 
number of workflow schemata it contains. An instance of any WS^ is also an instance 
of denoted by / |= □ 

Given Cp,’we aim at discovering a disjunctive schema as “close” as possible 

to the actual unknown schema WS{P) that had generated the logs. This intuition can 
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be formalized by accounting for two criteria, namely completeness and soundness, con- 
straining the discovered workflow to admit exactly the traces of the log. Obviously, we 
preliminary need some mechanisms for deciding whether a given trace in £p can be 
actually derived from a real instantiation of a workflow . Ideally, we might exploit 
the following definition. 

Definition 2. Let s be a trace in £p, be a disjunctive workflow schema, and 

I = {Aj, El) be an instance of it. Then, s is compliant with through I, denoted 

by s if s is a topological sort of I, i.e., s is an ordering of the activities in Ai 

s.t. for each (a, b) G Ej, i < j where s[i] = a and s[j] = b. Moreover, s is simply said 
to be compliant with denoted by s ^ if there exists I with s \=^ □ 



We are now ready to introduce, for a disjunctive workflow schema and for a trace log, 
the notions of soundness (i.e., every instance must be witnessed by some trace in the 
log) and of completeness (all traces are compliant with some instance). As the schema 
is not given but discovered from the analysis of the trace log, the two notions are given 
with a certain amount of uncertainty. 



Definition 3. Let be a disjunctive workflow model, and £p be a log for process 
P. We define: 



soundness{'WS^ ,Lp) = i.e., the percentage 

of instances having no corresponding traces in the log; 



completeness {W , Cp) = 

are compliant with some trace in the log. 



11 



, i.e., the percentage of traces that 



Given two real numbers a and a between 0 and 1 (typically a is small whereas cr is 
close to 1) we say that is 

- a-sound w.r.t. £p, if soundness{WS^ , Cp) < a, i.e. the smaller the sounder; 

- a-complete w.r.t. Cp, if completeness (yV S'^ , Cp) > a, i.e., the larger the more 

complete. □ 

We want to discover a disjunctive schema for a given process P which is 

a-sound and a-complete, for some given a and a. However, it is easy to see that a 
trivial schema satisfying the above conditions always exists, consisting in the union of 
exactly one workflow (without global constraints) modelling each of the instances in 
Cp. However, such model would be not a syntectic view of the process P, for its size 
being |yV5^| = |£p|, where |£p| = |{s | s G £}|. We therefore introduce a bound on 
the size of 



Definition 4. (Minimal Process Discovery) Let £p be a workflow log for the process P. 
Given a real number a and a natural number m, the Minimal Process Discovery problem, 
denoted by MPD(P,cr, m), consists in finding a cr-complete disjunctive workflow schema 
W5^, suchfhat |W5^| < m and soundness{WS^ , Cp) is minimal. □ 

The problem is obviously solvable as one may sacrifice enough portions of soundness 
to get a result. But, as it is shown next, the problem is untractable. W.l.o.g., let us assume 
that the values representing soundness are suitably discretized as positive integers so 
that we can represent MPD as an NP optimization problem. 
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Theorem 1. MPDf P,a,m) is an NP-complete optimization problem whose set of feasible 
solution is not empty. 

Armed with the above result, we turn to the problem PD(P, cr,m) of greedily finding a 
suitable approximation, that is a tr-complete workflow schema , with | \ < m, 

which is as sound as possible. In the rest, we shall propose an efficient technique for 
solving this problem. 



3 Clustering Workflow Traces 



In order to mine the underlying workflow schema of the process P (problem PD(P,a,m)) 
we exploit the idea of iteratively and incrementally refining a schema, by mining some 
global constraints which are then used for discriminating the possible executions, starting 
with a preliminary disjunctive model W5^, which only accounts for the dependencies 
among the activities in P. 

The algorithm ProcessDiscover, shown in Figure 1, which computes WiS^ through 
a hierarchical clustering, first mines a control flow according to the threshold 

cr^ through the procedure mine Precedences, which mainly exploits techniques already 
presented in the literature (see, e.g., [1,18], and, therefore, it is not illustrated in more 
details. Each workflow schema WSf , eventually inserted in , is identified by the 
number i of refinements needed, and an index j for distinguishing the schemas at the 
same refinement level. Moreover, we denote by CfWSl ) the set of traces in the cluster 
defined by WSl . Notice that preliminarily J , containing all the logs in , is inserted 

in yVS'^ , and in Step 3 we refine the model by mining some local constraints, too. 

The algorithm is also guided by a greedy heuristic that at each step selects a schema 
yVSj € W5^, for being refined with the function refine Workflow, by preferring the 
schema which can be most profitably refined. In practice, we refine the the least sound 
schema among the ones already discovered; however, some experiments have been also 
conduced refining the schema WSj with the maximum value of |£(WiS^)|. 

In order to reuse well know clustering methods, and specifically in our implementation 
the k-means algorithm, the procedure refineWorkflow translates the logs C{yVSj) to 
relational data with the procedures identifyRelevantFeatures and project, which will be 
discussed in the next section. Then, if more than one feature is identified, it computes 
the clusters ..., where j is the maximum index of the schemas already 

inserted in at the level i + 1, by applying the k-means algorithm on the traces 

in C{WSl), and put inserts them into the disjunctive schema Finally, for each 

schema inserted the procedure mineLocalConstraint is applied, in order to identify 
local constraints as well. 

The algorithm ProcessDiscover converges in at most m steps (see Step 4), and 
exploits the following interesting property of the procedure refineWorkflow. We observe 
that at each step of workflow refinement the value of soundness decreases, thus the 
algorithm gets closer to the optimal solution. 

* Roughly, the edges in CTa represent a minimal set of precedences with at least a given support 
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Input: Problem PD(P,cr,m), natural number maxFeatures. 

Output: A process model. 

Method: Perform the following steps: 

1 CJFcr(WtSQ) :—minePrecedences{Cp)\ //See Section 3.1 

2 let W5 q be a schema, with £(W5q) — Cp \ 

3 mineLocalConstraints(V\l S q)\ //See Section 3.1 

3 WSq‘, //Start clu.^tering with the dependency graph only 

4 w hi le |W5^ | < m do 

5 :—leastSound(WS^ )\ 

6 W5'^ W5^ - {W5^}; 

7 refineWorkflow{i,j)\ 

8 end while 

9 return W5^; 

Procedure refineWorkftow{i: step, j\ schema); 

1 T \=identifyRelevantFeatures{C{yVS^), (j,maxFeatures.,ClF a)\ //See Section 4.1 

2 n{WS{) ■.^project{C{WSi),F)\ ' //See Section 4.2 

3 k:=\F\\ 

4 if fc > 1 then 

5 j max{j | G W5^}; 

6 k-means{n{WS{))\ 

I for each W5 ^ do 

8 W5^ - W5^ U 

9 CPcr{VV5^_(_j^) \=minePrecedences{C{y\/S^j^^))\ 

10 mineLocalConstraints(y\} ^ ); 

I I end for 

12 else //Leave of the tree 

13 WS'^ = WS'^ U {WSl}; //See Theorem 2.2 

14 end if; 



Fig. 1. Algorithm ProcessDiscover 



Theorem 2. Given a disjunctive schema with WS^ G the disjunctive 

workflow schema W5^, obtained by refining — {WS^} with the procedure 

refineWorkflow(i,j), is such that soundness{yVS^) < soundness{WS'^) . 

A main point of the algorithm is fixing the number k of new schemata to be added 
at each refinement step. The range of k goes from a minimum of 2, which will require 
several steps for the computation, to an unbounded value, which will return the result 
in only one step. One could then expect that the latter case is most efficient. This is 
not necessarily true: the clustering algorithm could run slower with a larger number of 
classes thus loosing the advantage of a smaller number of iterations. In contrast, there 
is an important point in favor of a small value for k: the representation of the various 
schemata can be optimized by preserving the tree structure and storing for each node only 
the differences w.r.t. the schema of the father node. The tree representation is relevant 
not only because of the space reduction but also because it give more insights on the 
properties of the modelled workflow instances and provides an intuitive and expressive 
description of global constraints. 



3.1 Dealing with Relevant Features 

The crucial point of the algorithm for clustering workflow traces lies in the formalization 
of the procedures MentifyRelevantFeatures and project. Roughly, the former identifies a 
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set^ of relevant features [10,1 1,14], whereas the latter projects the traces into a vectorial 
space whose components are, in fact, these features. 

Some works addressing the problem of clustering complex data considered the most 
frequent common structures (see e.g. [2,3,7]), also called frequent patterns, to be the 
relevant features for the clustering. Since we are interested in features that witness some 
kind of global constraints, we instead exploit the more involved notion of unexpected 
(w.r.t. the local properties) frequent rules. 

Let £ be a set of traces, CT „ be a mined control flow, for threshold cr, and be the 
edge set of CT Then a sequence \ai...ah] of tasks is cr-frequent in £ if |{s G £ | oi = 
s[zi], ..., ttfi = s[ih] Ail < < *?i}|/|'C| > cr. We say that [ai...ah] a-precedes a in £, 

denoted by [ai...ah] a, if both [ai...ah] and [ai...aho\ are cr-frequent in £. 

Definition 5 (Discriminant Rules). A discriminant rule (feature) f is an expression 
of the form [ai...ah] ~/~^a a, s.t. (i) [ai...ah] is cr-frequent in £, (ii) {ah, a) G E„, 
and (hi) \ai...ah\ — >-cr a does not hold. Moreover, </> is minimal if (iv) there is no b, s.t. 
[ax-. .ah] b and [6] — a, and (v) there is no j, s.t. j > 1 and [aj...ah] a. □ 

The identification of discriminant rules can be carried out by means of the level-wise 
algorithm shown in Figure 2. At each step k of the computation, we store in Lk all the 
cr-frequent sequences whose size is k. Specifically, in the Steps 5-9, the set of potential 
sequences M to be included in Lk+i are obtained by combining those in Lk with the 
relationships of precedences in £2 — notice that Step 7 prevents the computation of not 
minimal unexpected rules. Then, only cr-frequent pattern in M are included in Lk+i 
(Step 1 1), while all the others will determine unexpected rules (Step 12). The process is 
repeated until no other frequent traces are found. The correctness of the algorithm can 
be easily proven. 

Theorem 3. In the algorithm of Figure 2, before its termination (Step 16): 

1. the set R contains exactly all the a-frequent sequences of tasks, and 

2. the set T contains exactly all the minimal discriminant rules. 

Notice that the algorithm IdentifyRelevantFeatures does not directly output !F, but 
call the procedure mostDiscriminantFeatures, whose aim is to find a proper subset of T 
which better discriminates the traces in the log. 

This intuition can be formalized as follows. Let be a discriminant rule of the form 
[oi, ..., af /-^cr b, then the witness of f in £, denoted by w{(f>, £), is the set of logs in 
which the pattern [ai, ..., a j] occurs. 

Moreover, given a set of rules R, then the witness of i? in £ is 
a fixed k, R is the most discriminant fc-set of features if |i?| = k and there exists 
no R' with |zi;(i?',£)| > \w{R,C)\, and |i?'| = k. Notice that the most discriminant 
fc-set of features can be computed in polynomial time by considering all the possible 
combinations of features of R, with k element. 

The minimum k, for which the most discriminant fc-set of features, say S, covers all 
the logs, i.e., w{S,C) = £, is called dimension of £, whereas S is the most discriminant 
set of features. 
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Input: A log C, a threshold cr, the max nr. of features maxF eatures, the control flow graph CJFcr, with edge set Eo-- 
Output: A set of minimal discriminant rules. 

Method: Perform the following steps: 

1 L 2 {[ab] I (a,b) G 

2 k := 1, R := L2,E 0; 

3 repeat 

4 M 0; A; fc + 1; 

5 forall [ai...aj] G Lk do 

6 forall [ajb] G L 2 do 

7 if [at 4 -i...aj] h is not in JG then 

8 M \= M U [ai...ajb]; 

9 end for 

10 forall p G M of the form [ai...ajb] do 

11 if p is cr-frequent in £ then Lfc+i {p}; 

12 else .G” := E U {[ai...aj] b}; //See Theorem 3.2 

13 end for 

14 i? i? U Lfc+i ; //See Theorem 3.1 

15 until Lfc+i = 0; 

16 return mostDiscriminant{lF)\ 

Procedure mostDiscriminantFeatures{/F\ set of unexpected rules): set of unexpected rules; 

1 S' C\E' ■.= 0; 

2 do 

3 let = argmax_^/^_^ |u;(0',S')h 

4 T’ ■.= T' U 

5 S' ■- S' - w{^,S')-, 

6 while (IS^ I / |£p I > cr) and (E' < maxFeatures)\ 

1 return.?^': 



Fig. 2. Algorithm IdentifyRelevantFeatures 



Theorem 4. Let C be a set of traces, n be the size of C (i.e., the sum of the lengths of 
all the traces in C), and J- be a set of features. Then, computing any most discriminant 
set of features is NP hard. 

Due to the intrinsic difficulty of the problem, we turn to the computation of a suitable 
approximation. In fact, the procedure mostDiscriminantFeatures, actually implemented 
in the algorithm for identifying relevant features, computes a set T' of discriminant 
rules, guided by the heuristics of greedily selecting a feature f covering the maximum 
number of traces, among the ones (S') not covered by previous selections. 

Finally, the set of relevant features T, can be used for representing each trace s as 
a point in the vectorial space denoted by ~i' . Then, the procedure project maps 
traces in where k-means algorithm can operate. Due to its simplicity we do not 
report the code here. 



4 Experiments 

In this section we study the behavior of the ProcessDiscover algorithm for evaluating 
both its effectiveness and its scalability, with the help of a number of tests performed on 
synthetic data. The generation of such data can be tuned according to: (i) the size of W5, 
(ii) the size of Cp, (iii) the number of global constraints in Cq, and (iv) the probability p 
of choosing any successor edge, in the case of nondeterministic fork activities. The ideas 
adopted in generating synthetic data are essentially inspired by [3], and the generator 
we exploited is an extension of the one described in [6] . 
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Fig. 3. Fixed Schema. Left: Soundness w.r.t. levels. Right: Scaling w.r.t. number of traces. 
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Fig. 4. Variable Workflow Schema. Left: Soundness w.r.t. k. Right: Scalability w.r.t. k. 



Test Procedure. In order to asses the effectiveness of the technique, we adopted the 
following test procedure. Let WS{I) be a workflow schema for the input process I, and 
Cj a log produced with the generator. The quality of any workflow W5^(0), extracted 
by providing the mining algorithm with £/, is evaluated, w.r.t. the original one W5(/), 
essentially by comparing two random samples of the traces they respectively admit. This 
allow us to compute an estimate of the actual soundness and completeness. Moreover, 
in order to avoid statistical fluctuations in our results, we generate a number of different 
training logs, and hence, whenever relevant, we report for each measure its mean value 
together with the associated standard deviation. In the test described here, we focus on 
the influence of two major parameters of the method: (i) the branching factor k and 
(ii) the maximum number {max Levels) of levels in the resulting disjunctive scheme. 
Notice that the case k = 1 coincides with traditional algorithms which do not account 
for global constraints. All the tests have been conduced on a 1600MHz/256MB Pentium 
IV machine running Windows XP Professional. 

Results. In a first set of experiments we considered a fixed workflow schema and 
some randomly generated instances. Figure 3 (on the left) reports the mean value and 
the standard deviation of the soundness of the mined model, for increasing values of | £/ 1 
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by varying the factor k. Notice that for k = 1, the algorithm degenerates in computing a 
unique schema, and in fact, the soundness is not affected by the parameter max Level — 
this is the case of any algorithm accounting of local constraints only. Instead, for fc > 1, 
we can even rediscover exactly the underlying schema, after a number of iterations. 
These experiments have been conduced on an input log of 1000 instances. Then, on the 
right, we report the scaling of the approach at the varying of the number of logs in £/. 

In a second set of experiments we also consider variable schemas. In Figure 4 we 
report the results for four different workflow schemas. Observe (on the left) that for a 
fixed value of k, the soundness of the mined schema tends to be low at the increasing of 
the complexity of the schemas, consisting of many nodes and possibly many constraints. 
This witness the fact that on real processes, traditional approaches (with k = 1) performs 
poorly, and that for having an effective reconstruction of the process it is necessary not 
only to fix fc > 1, but also to deal with several levels of refinements. Obviously, for 
complex schemas, the algorithm takes more time, as shown in the same figure on fhe 
right. 



5 Conclusions 

In this paper, we have continued on the way of the investigation of data mining techniques 
for process mining, by providing a method for discovering global constraints, in terms of 
the patterns of executions they impose. This is achieved through a hierarchical clustering 
of the logs, in which each trace is seen as a point of a properly identified space of features. 
The precise complexity of the task of constructing this space is provided, as well as a 
practical efficient algorithm for its solution. 
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Abstract. Tree structures are used extensively in domains such as 
computational biology, pattern recognition, XML databases, computer 
networks, and so on. One important problem in mining databases of 
trees is to find frequently occurring subtrees. However, because of the 
combinatorial explosion, the number of frequent subtrees usually grows 
exponentially with the size of the subtrees. In this paper, we present 
CMTreeMiner, a computationally efficient algorithm that discovers all 
closed and maximal frequent subtrees in a database of rooted unordered 
trees. The algorithm mines both closed and maximal frequent subtrees 
by traversing an enumeration tree that systematically enumerates all 
subtrees, while using an enumeration DAG to prune the branches of 
the enumeration tree that do not correspond to closed or maximal 
frequent subtrees. The enumeration tree and the enumeration DAG 
are dehned based on a canonical form for rooted unordered trees-the 
depth- hrst canonical form (DFCF). We compare the performance of our 
algorithm with that of PathJoin, a recently published algorithm that 
mines maximal frequent subtrees. 

Keywords: Frequent subtree, closed subtree, maximal subtree, enumer- 
ation tree, rooted unordered tree. 



1 Introduction 

Tree structures are used extensively in domains such as computational biology, 
pattern recognition, XML databases, computer networks, and so on. Trees in real 
applications are often labeled, with labels attached to vertices and edges where 
these labels are not necessarily unique. In this paper, we study one important 
issue in mining databases of labeled rooted unordered trees-finding frequently 
occurring subtrees, which has much practical importance [5]. However, as we 
have discovered in our previous study [5], because of the combinatorial explosion, 
the number of frequent subtrees usually grows exponentially with the tree size. 
This is the case especially when the transactions in the database are strongly 
correlated. This phenomenon has two effects: first, there are too many frequent 
subtrees for users to manage and use, and second, an algorithm that discovers all 
frequent subtrees is not able to handle frequent subtrees with large size. To solve 

* This work was supported by NSF under Grant Nos. 0086116, 0085773, and 9817773. 
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this problem, in this paper, we propose CMTreeMiner, an efficient algorithm 
that, instead of looking for all frequent subtrees, only discovers both closed and 
maximal frequent subtrees in a database of labeled rooted unordered trees. 

Related Work Recently, there has been growing interest in mining 
databases of labeled trees, partly due to the increasing popularity of XML in 
databases. In [9], Zaki presented an algorithm, TreeMiner, to discover all fre- 
quent embedded subtrees, i.e., those subtrees that preserve ancestor-descendant 
relationships, in a forest or a database of rooted ordered trees. The algorithm 
was extended further in [10] to build a structural classifier for XML data. In [2] 
Asai et al. presented an algorithm, FREQT, to discover frequent rooted ordered 
subtrees. For mining rooted unordered subtrees, Asai et al. in [3] and we in [5] 
both proposed algorithms based on enumeration tree growing. Because there 
could be multiple ordered trees corresponding to the same unordered tree, simi- 
lar canonical forms for rooted unordered trees are defined in both studies. In [4] 
we have studied the problem of indexing and mining free trees and developed 
an Apriori-like algorithm, FreeTreeMiner, to mine all frequent free subtrees. In 
[8] , Xiao et al. presented an algorithm called PathJoin that rewrites a database 
into a compact in-memory data structwce-FST-Forest, for the mining purpose. 
In addition, to the best of our knowledge, PathJoin is the only algorithm that 
mines maximal frequent subtrees. 

Our Contributions The main contributions of this paper are: (1) We in- 
troduce the concept of closed frequent subtrees and study its properties and its 
relationship with maximal frequent subtrees. (2) In order to mine both closed 
and maximal frequent rooted unordered subtrees, we present an algorithm- 
CMTreeMiner, which is based on the canonical form and the enumeration tree 
that we have introduced in [5]. We develop new pruning techniques based on 
an enumeration DAG. (3) Finally, we have implemented our algorithm and have 
carried out experimental study to compare the performance of our algorithm 
with that of PathJoin [8] . 

The rest of the paper is organized as follows. In section 2, we give the back- 
ground concepts. In section 3, we present our CMTreeMiner algorithm. In sec- 
tion 4, we show experiment results. Finally, in Section 5, we give the conclusion. 



2 Background 

2.1 Basic Concepts 

In this section, we provide the definitions of the concepts that will be used in 
the remainder of the paper. We assumed that the readers are familiar with the 
notions such as rooted unordered tree, ancestor / descendant, parent / child, leaf, 
etc. In addition, a rooted tree t is a (proper) subtree of another rooted tree s if 
the vertices and edges of t are (proper) subsets of those of s. If t is a (proper) 
subtree of s, we say s is a (proper) supertree of t. Two labeled rooted unordered 
trees t and s are isomorphic to each other if there is a one-to-one mapping from 
the vertices of t to the vertices of s that preserves vertex labels, edge labels. 
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adjacency, and the root. A subtree isomorphism from t to s is an isomorphism 
from t to some subtree of s. For convenience, in this paper we call a rooted tree 
with k vertices a fc-tree. 

Let D denote a database where each transaction s € D is a labeled rooted 
unordered tree. For a given pattern t, which is a rooted unordered tree, we 
say t occurs in a transaction s if there exists at least one subtree of s that is 
isomorphic to t. The occurrence 5t{s) of t in s is the number of distinct subtrees 
of s that are isomorphic to t. Let <Jt{s) = 1 if St{s) > 0, and 0 otherwise. We 
say s supports pattern t if at{s) is 1 and we define the support of a pattern t 
as supp{t) = ^ pattern t is called frequent if its support is greater 

than or equal to a minimum support (minsup) specified by a user. The frequent 
subtree mining problem is to find all frequent subtrees in a given database. 

One nice property of frequent trees is the a priori property, as given below: 

Property 1. Any subtree of a frequent tree is also frequent and any supertree of 
an infrequent tree is also infrequent. 

We define a frequent tree t to be maximal if none of t’s proper supertrees 
is frequent, and closed if none of t’s proper supertrees has the same support 
that t has. For a subtree t, we define the blanket of t as the set of subtrees Bt 
= {t' [removing a leaf or the root from t' can result in t}. In other words, the 
blanket Bt of t is the set of all supertrees of t that have one more vertex than 
t. With the definition of blanket, we can define maximal and closed frequent 
subtrees in another equivalent way: 

Property 2. A frequent subtree t is maximal iff for every t' € Bt, supp(t') < 
minsup] a frequent subtree t is closed iff for every t' € Bt, supp{t') < supp(t). 

For a subtree t and one of its supertrees t' € Bt, we define the difference 
between t' and t {t'\t in short) as the additional vertex of t' that is not in t. 
We say t' S Bt and t are occurrence-matched if for each occurrence of t in (a 
transaction of) the database, there is at least one corresponding occurrence of 
t'; we say t' G Bt and t are support-matched if for each transaction s G D such 
that <Tt(s) = 1, we have cJt'{s) = 1. It is obvious that if t' and t are occurrence- 
matched, it implies that they are support-matched. 

2.2 Properties of Closed and Maximal Ftequent Subtrees 

The set of all frequent subtrees, the set of closed frequent subtrees and the set 
of maximal frequent subtrees have the following relationship. 

Property 3. For a database D and a given minsup, let T be the set of all frequent 
subtrees, C be the set of closed frequent subtrees, and At be the set of maximal 
frequent subtrees, then M. G_C G_ T . 

The reason why we want to mine closed and maximal frequent subtrees in- 
stead of all frequent subtrees is that usually, there are much fewer closed or 
maximal frequent subtrees compared to the total number of frequent subtrees 
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[7]. In addition, by mining only closed and maximal frequent subtrees, we do not 
lose much information because the set of closed frequent subtrees maintains the 
same information (including support) as the set of all frequent subtrees and the 
set of maximal frequent subtrees subsumes all frequent subtrees: 

Property 4- We can obtain all frequent subtrees from the set of maximal frequent 
subtrees because any frequent subtree is a subtree of one (or more) maximal 
frequent subtree(s); similarly, we can obtain all frequent subtrees with their 
supports from the set of closed frequent subtrees with their supports, because 
for a frequent subtree t that is not closed, supp{t) = max(/{sMpp(t')} where t' is 
a supertree of t that is closed. 

2.3 The Canonical Form for Rooted Labeled Unordered Trees 

From a rooted unordered tree we can derive many rooted ordered trees, as shown 
in Figure 1. From these rooted ordered trees we want to uniquely select one as 
the canonical form to represent the corresponding rooted unordered tree. Notice 
that if a labeled tree is rooted, then without loss of generality we can assume 
that all edge labels are identical: because each edge connects a vertex with its 
parent, so we can consider an edge, together with its label, as a part of the child 
vertex. So for all running examples in the following discussion, we assume that 
all edges in all trees have the same label or equivalently, are unlabeled, and we 
therefore ignore all edge labels. 




Fig. 1. Four Rooted Ordered Trees Obtained from the Same Rooted Unordered Tree 

Without loss of generality, we assume that there are two special symbols, 
“$” and which are not in the alphabet of edge labels and vertex labels. In 
addition, we assume that (I) there exists a total ordering among edge and vertex 
labels, and (2) sorts greater than “$” and both sort greater than any other 
symbol in the alphabet of vertex and edge labels. We first define the depth-first 
string eneoding for a rooted ordered tree through a depth-first traversal and 
use “$” to represent a backtrack and to represent the end of the string 
encoding. The depth- first string encodings for each of the four trees in Figure 1 
are for (a) ABC$$BD$C#, for (b) ABC$$BC$D#, for (c) ABD$C$$BC#, 
and for (d) ABC$D$$BC=fi. With the string encoding, we define the depth- first 
canonical string (DECS) of the rooted unordered tree as the minimal one among 
all possible depth-first string encodings, and we define the depth-first canonical 
form (DFCF) of a rooted unordered tree as the corresponding rooted ordered 
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tree that gives the minimal DFCS. In Figure 1, the depth-first string encoding 
for tree (d) is the DFCS, and tree (d) is the DFCF for the corresponding labeled 
rooted unordered tree. Using a tree isomorphism algorithm given by Aho et al. 
[1,4,5], we can construct the DFCF for a rooted unordered tree in 0(cfc log fc) 
time, where k is the number of vertices the tree has and c is the maximal degree 
of the vertices in the tree. 

For a rooted unordered tree in its DFCF, we define the rightmost leaf as 
the last vertex according to the depth-first traversal order, and rightmost path 
as the path from the root to the rightmost leaf. The rightmost path for the 
DFCF of the above example (Figure 1(d)) is the path in the shaded area and 
the rightmost leaf is the vertex with label C in the shaded area. 



3 Mining Closed and Maximal Frequent Subtrees 



Now, we describe our CMTreeMiner algorithm that mines both closed and max- 
imal frequent subtrees from a database of labeled rooted unordered trees. 



3.1 The Enumeration DAG and the Enumeration Tree 

We first define an enumeration DAG that enumerates all rooted unordered trees 
in their DFCFs. The nodes of the enumeration DAG consist of all rooted un- 
ordered trees in their DFCFs and the edges consist of all ordered pairs {t,t') 
of rooted unordered trees such that t' € Bt. Figure 2 shows a fraction of the 
enumeration DAG. (For simplicity, we have only shown those trees with A as 
the root.) 

Next, we define a unique enumeration tree based on the enumeration DAG. 
The enumeration tree is a spanning tree of the enumeration DAG so the two 
have the same set of the nodes. The following theorem is key to the definition 
of the enumeration tree. 

Theorem 1. Removing the rightmost leaf from a rooted unordered (k+l)-tree 
in its DFCF will result in the DFCF for another rooted unordered k-treef 

Based on the above theorem we can build an enumeration tree such that the 
parent for each rooted unordered tree is determined uniquely by removing the 
rightmost leaf from its DFCF. Figure 3 shows a fraction of the enumeration tree 
for the enumeration DAG in Figure 2. In order to grow the enumeration tree, 
starting from a node v of the enumeration tree, we need to find all valid children 
of V. Each child of v is obtained by adding a new vertex to v so that the new 
vertex becomes the new rightmost leaf of the new DFCF. Therefore, the possible 
positions for adding the new rightmost leaf to a DFCF are the vertices on the 
rightmost path of the DFCF. 



^ Proofs for all the theorems in this paper are available in [6]. 
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Fig. 2. The Enumeration DAG for Fig. 3. The Enumeration Tree for 

Rooted Unordered Trees in DFCFs Rooted Unordered Trees in DFCFs 



3.2 The CMTreeMiner Algorithm 

In the previous section, we have used the enumeration tree to enumerate all 
(frequent) subtrees in their DFCFs. However, the final goal of our algorithm is 
to find all closed and maximal frequent subtrees. As a result, it is not necessary 
to grow the complete enumeration tree, because under certain conditions, some 
branches of the enumeration tree are guaranteed to produce no closed or maximal 
frequent subtrees and therefore can be pruned. In this section, we introduce 
techniques that prune the unwanted branches with the help of the enumeration 
DAG (more specifically, the blankets). 

Let us look at a node Vt in the enumeration tree. We assume that Vt corre- 
sponds to a frequent fc-subtree t and denote the blanket of t as Bt- In addition, 
we define three subsets of Bt'. 

Bt = {t' & Bt\t' IS frequent} 

B^m _ 1 ^/ g and t are support-matched} 

bOM _ 1 ^/ g and t are occurrence-matched} 

From Property 2 we know that t is closed iff = 0, that t is maximal 
iff Bf = 0, and that B^^ C Bf^. Therefore by constructing Bf^ and Bf 
for t, we can know if t is closed and if t is maximal. However, there are two 
problems. First, knowing that t is not closed does not automatically allow us 
to prune Vt from the enumeration tree, because some descendants of Vt in the 
enumeration tree might be closed. Second, computing B[^ is time and space 
consuming, because we have to record all members of Bt and their support. So 
we want to avoid computing Bf whenever we can. In contrast, computing Bf^ 
and B^^ is not that difficult, because we only need to record the intersections 
of all occurrences. 

To solve the first problem mentioned above, we use B^^ , instead of Bf^^ 
to check if vt can be pruned from the enumeration tree. For a G B^^ (i.e.. 
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Fig. 4. Locations for an Additional Fig. 5. Computing the Range of a New 

Vertex to Be Added To a Subtree Vertex for Extending a Subtree 



t° and t are occurrence-matched), the new vertex t°\t can occur at different 
locations, as shown in Figure 4. In Case I of Figure 4, t°\t is the root of t°; in 
Case II t°\t is attached to a vertex of t that is not on the rightmost path; in 
Case III and case IV, t°\t is attached to a vertex on the rightmost path. The 
difference between Case III and Case IV is whether or not t°\t can be the new 
rightmost vertex of t°. 

To distinguish Case III and Case IV in Figure 4, we compute the range of 
vertex labels that could possibly be the new rightmost vertex of a supertree in 
Bt- Notice that this information is also important when we extend Vt in the 
enumeration tree-we have to know what are the valid children of Vt- Figure 5 
gives an example for computing the range of valid vertex labels at a given position 
on the rightmost path. In the figure, if we add a new vertex at the given position, 
we may violate the DFCF by changing the order between some ancestor of the 
new vertex (including the vertex itself) and its immediate left sibling. So in 
order to determine the range of allowable vertex labels for the new vertex (so 
that adding the new vertex will guarantee to result in a new DFCF), we can 
check each vertex along the path from the new vertex to the root. In Figure 5, 
the result of comparison (1) is that the new vertex should have label greater 
than or equal to A, comparison (2) increases the label range to be greater than 
or equal to B, and comparison (3) increases the label range to be greater than or 
equal to C. As a result, before start adding new vertices, we know that adding 
any vertex with label greater than or equal to C at that specific position will 
surely result in a DFCF. Therefore, at this given location, adding a new vertex 
with label greater than or equal to C will result in case IV (and therefore the 
new vertex becomes the new rightmost vertex), and adding a new vertex with 
label less than C will result in case III in Figure 4. 

Now we propose a pruning technique based on B^^ , as given in the following 
theorem. 

Theorem 2. For a node Vt in the enumeration tree and the corresponding sub- 
tree t, assume that t is frequent and B^^ ^ If there exists a t° G B^^ such 
that t°\t is at location of Case I, II, or III in Figure 4> then neither vt nor any 
descendant of Vt in the enumeration tree correspond to closed or maximal fre- 
quent subtrees, therefore Vt (together with all of its descendants) can be pruned 
from the enumeration tree. 
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For the second problem mentioned above, in order to avoid computing as 
much as possible, we compute and Bf^ first. If some t° € B^^ is of Case 
I, II, or III in Figure 4, we are lucky because Vt (and all its descendants) can 
be pruned completely from the enumeration tree. Even if this is not the case, as 
long as Bf^ ^ 0, we only have to do the regular extension to the enumeration 
tree, with the knowledge that t cannot be maximal. To extend Vt, we find all 
potential children of Vt by checking the potential new rightmost leaves within the 
range that we have computed as described above. Only when B^^ = 0, before 
doing the regular extension to the enumeration tree, do we have to compute Bf 
to check if t is maximal. Putting all the above discussion together. Figure 6 gives 
our CMTreeMiner algorithm. 



Algorithm CMTreeMiner(D, minsup) 

1: CL ^ 0, MX ^ 0; 

2: C frequent 1-trees; 

3: CM-Grow(C',CL, MX, minsup)\ 

4: return CL, MX\ 

Algorithm CM-Grow(C',CL, MX, minsup) 

1: for i -<r- 1, . . . , |C| do 
2: E 4-0-, 

3: compute 

4: if 3c° € that is of case I, II, or III then continue; 

5: if = 0 then 

6 : CL^CLUCi-, 

7: compute 

8: if B^.=0 then MX 4- MX U cp, 

9: for each vertex on the rightmost path of a do 

10: for each valid new rightmost vertex Vm of d do 

11: e 4^ Ci plus vertex Vm, with Vn as Vm’s parent; 

12: if suppie) > minsup then B B U e; 

13: if B 7 ^ 0 then CM-Grow(B, CL, MA, minsup); 

14: return; 



Fig. 6. The GMTreeMiner Algorithm 



We want to point out two possible variations to the CMTreeMiner algorithm. 
First, the algorithm mines both closed frequent subtrees and maximal frequent 
subtrees at the same time. However, the algorithm can be easily changed to mine 
only closed frequent subtrees or only maximal frequent subtrees. For mining only 
closed frequent subtrees, we just skip the step of computing B^ . For mining only 
maximal frequent subtrees, we just skip computing Bf^ and use B^^ to prune 
the subtrees that are not maximal. Notice that this pruning is indirect: B^^ 
only prunes the subtrees that are not closed, but if a subtree is not closed then it 
cannot be maximal. If B^^ = 0, for better pruning effects, we can still compute 
Bf^ to determine if we want to compute Bf . If this is the case, although we 
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only want the maximal frequent subtrees, the closed frequent subtrees are the 
byproducts of the algorithm. For the second variation, although our enumeration 
tree is built for enumerating all rooted unordered subtrees, it can be changed 
easily to enumerate all rooted ordered subtrees-the rightmost expansion is still 
valid for rooted ordered subtrees and we only have to remove the canonical 
form restriction. Therefore, our CMTreeMiner algorithm can handle databases 
of rooted ordered trees as well. 

4 Experiments 

We performed extensive experiments to evaluate the performance of the CMTree- 
Miner algorithm using both synthetic datasets and datasets from real applica- 
tions. Due to the space limitation, here we only report the results for a synthetic 
dataset. We refer interested readers to [6] for other results. All experiments were 
done on a 2GHz Intel Pentium IV PC with IGB main memory, running Red- 
Hat Linux 7.3 operating system. All algorithms were implemented in C-|— I- and 
compiled using the g-l— I- 2.96 compiler. 

As far as we know, PathJoin [8] is the only algorithm for mining maximal 
frequent subtrees. PathJoin uses a subsequent pruning that, after obtaining all 
frequent subtrees, prunes those frequent subtrees that are not maximal. Because 
PathJoin uses the paths from roots to leaves to help subtree mining, it does not 
allow any siblings in a tree to have the same labels. Therefore, we have generated 
a dataset that meets this special requirement. We have used the data generator 
given in [5] to generate synthetic data. The detailed procedure for generating 
the dataset is described in [5] and here we give a very brief description. A set 
of |A^| subtrees are sampled from a large base (labeled) graph. We call this set 
of |A^| subtrees the seed trees. Each seed tree is the starting point for \D\ ■ [S'! 
transactions where \D\ is the number of transactions in the database and [S'! is 
the minimum support. Each of these \D\ ■ IS"! transactions is obtained by first 
randomly permuting the seed tree then adding more random vertices to increase 
the size of the transaction to \T\. After this step, more random transactions with 
size |T| are added to the database to increase the cardinality of the database to 
\D\. The number of distinct vertex labels is controlled by the parameter \L\. The 
parameters for the dataset used in this experiment are: |D|=100000, |iV|=90, 
|L|=1000, |S'|=1%, |T|=|J|, and |/| varies from 5 to 50. (For |/| > 25, PathJoin 
exhausts all available memory.) 

Figure 7 compares the performance of PathJoin with that of CMTreeMiner. 
Figure 7(a) gives the number of all frequent subtrees obtained by PathJoin, the 
number of subtrees checked by CMTreeMiner, the number of closed frequent 
subtrees, and the number of maximal frequent subtrees. As the figure shows, the 
number of subtrees checked by CMTreeMiner and the number of closed subtrees 
grow in polynomial fashion. In contrast, the total number of all frequent subtrees 
(which is a lower bound of the number of subtrees checked by PathJoin) grows 
exponentially. As a result, as demonstrated in Figure 7(b), although PathJoin is 
very efficient for datasets with small tree sizes, as tree sizes increases beyond some 
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(a) 



(b) 



Fig. 7 . CMTreeMiner vs. PathJoin 



reasonably large value (say 10), it becomes obvious that PathJoin suffers from 
exponential explosion while CMTreeMiner does not. (Notice the logarithmic 
scale of the figure.) For example, with the size of the maximal frequent subtrees 
to be 25 in the dataset, it took PathJoin around 3 days to find all maximal 
frequent subtrees while it took CMTreeMiner only 90 seconds! 



5 Conclusion 

In this paper, we have studied the issue of mining frequent subtrees from 
databases of labeled rooted unordered trees. We have presented a new efficient al- 
gorithm that mines both closed and maximal frequent subtrees. The algorithm is 
built based on a canonical form that we have defined in our previous work. Based 
on the canonical form, an enumeration tree is defined to enumerates all subtrees 
and an enumeration DAG is used for pruning branches of the enumeration tree 
that will not result in closed or maximal frequent subtrees. The experiments 
showed that our new algorithm performs in polynomial fashion instead of the 
exponential growth shown by other algorithms. 



Acknowledgements. Thanks to Professor Y. Xiao at the Georgia Gollege and 
State University for providing the PathJoin source codes and offering a lot of 
help. 
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Abstract. The sharing of association rules is often beneficial in 
industry, but requires privacy safeguards. One may decide to disclose 
only part of the knowledge and conceal strategic patterns which we 
call restrictive rules. These restrictive rules must be protected before 
sharing since they are paramount for strategic decisions and need to 
remain private. To address this challenging problem, we propose a 
unified framework for protecting sensitive knowledge before sharing. 
This framework encompasses: (a) an algorithm that sanitizes restrictive 
rules, while blocking some inference channels. We validate our algorithm 
against real and synthetic datasets; (b) a set of metrics to evaluate 
attacks against sensitive knowledge and the impact of the sanitization. 
We also introduce a taxonomy of sanitizing algorithms and a taxonomy 
of attacks against sensitive knowledge. 

Keywords: Privacy preserving data mining. Protecting sensitive knowl- 
edge, Sharing association rules. Data sanitization, Sanitizing algorithms. 



1 Introduction 

Protecting against inference learning in data mining sense has begun to receive 
attention. In particular, the problem of privacy preservation, when sharing data 
while wanting to conceal some restrictive associations has been addressed in the 
literature [1,2, 7, 4, 5]. The proposed solutions consist in transforming a transac- 
tional database to be shared in such a way that the restrictive rules cannot be 
discovered. This process is called data sanitization [1]. The effectiveness of the 
data sanitization is measured by the proportion of restrictive rules effectively 
hidden (hiding failure), the proportion of rules accidentally hidden (misses cost) 
and the amount of artifactual rules created by the process [4]. The problem we 
address here is different and more practical. It is the problem of rule sanitization. 
Rather than sharing the data, collaborators prefer to mine their own data and 
share the discovered patterns. 

Let us consider a motivating example based on a case discussed in [3] . Sup- 
pose we have a server and many clients in which each client has a set of sold items 
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(e.g. books, movies, etc). The clients want the server to gather statistical infor- 
mation about associations among items in order to provide recommendations to 
the clients. However, the clients do not want the server to know some restrictive 
association rules. In this context, the clients represent companies and the server 
is a recommendation system for an e-commerce application, for example, fruit 
of the clients collaboration. In the absence of rating, which is used in collab- 
orative filtering for automatic recommendation building, association rules can 
be effectively used to build models for on-line recommendation. When a client 
sends its frequent itemsets or association rules to the server, it sanitizes some 
restrictive itemsets according to some specific policies. The server then gathers 
statistical information from the sanitized itemsets and recovers from them the 
actual associations. 

The simplistic solution to address the motivating example is to implement 
a filter after the mining phase to weed out/hide the restricted discovered rules. 
However, we claim that trimming some rules out does not ensure full protection. 
The sanitization applied to the set of rules must not leave a trace that could 
be exploited by an adversary. We must guarantee that some inference channels 
have been blocked as well. 

This paper introduces the notion of rule sanitization. The main contribution 
of this paper is a novel framework for protecting sensitive knowledge before shar- 
ing association rules. This framework encompasses: (a) a sanitizing algorithm 
called Downright Sanitizing Algorithm (DSA). This algorithm sanitizes a set of 
restrictive rules while blocking some inference channels; (b) a set of metrics to 
evaluate attacks against sensitive knowledge and the impact of the sanitization. 
Another contribution is a taxonomy of existing sanitizing algorithms. Finally, we 
present a taxonomy of attacks against sensitive knowledge. To our best knowl- 
edge, the investigation of attacks against sensitive knowledge, notably in the 
context of data or rule sanitization, has not been explored in any detail. 

This paper is organized as follows. Related work is reviewed in Section 2. The 
problem definition is stated in Section 3. In Section 4, we present our framework 
for protecting sensitive knowledge. In Section 5, we introduce our Downright 
Sanitizing Algorithm (DSA). The experimental results and discussion are pre- 
sented in Section 6. Finally, Section 7 presents our conclusions and a discussion 
of future work. 



2 Related Work 

Some effort has been made to address the problem of protecting sensitive knowl- 
edge in association rule mining by data sanitization. The existing sanitizing 
algorithms can be classified into two major classes: Data-Sharing approach and 
Pattern- Sharing approach, as can be seen in Figure lA. In the former, the san- 
itization process acts on the data to remove or hide the group of restrictive 
association rules that contain sensitive knowledge. To do so, a small number of 
transactions that contain the restrictive rules have to be modified by deleting 
one or more items from them or even adding some noise, i.e., new items not 
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originally present in such transactions. In the latter, the sanitizing algorithm 
acts on the rules mined from a database, instead of the data itself. The only 
known algorithm in this category is our DSA algorithm herein presented. The 
algorithm removes all restrictive rules before the sharing process. 

Among the algorithms of the Data-Sharing approach, we classify the following 
categories: Item Restriction-Based, Item Addition-Based, and Item Obfuscation- 
Based. 

Item Restriction-Based: These algorithms [2] remove one or more items 
from a group of transactions containing restrictive rules. In doing so, the algo- 
rithms hide restrictive rules by reducing either their support or confidence below 
a privacy threshold. Other algorithms [4,5,6], that lie in this category, hide rules 
by satisfying a disclosure threshold ip controlled by the database owner. This 
threshold basically expresses how relaxed the privacy preserving mechanisms 
should be. When ip = 0%, no restrictive association rules are allowed to be dis- 
covered. When Ip = 100%, there are no restrictions on the restrictive association 
rules. 

Item Addition-Based: Unlike the previous algorithms, item addition-based 
algorithms modify existing information in transactional databases by adding 
some items not originally present in some transactions. The items are added to 
the antecedent part A of a rule A — >■ Y in transactions that partially support 
it. In doing so, the confidence of such a rule is decreased. This approach [2] may 
generate artifacts such as artificial association rules that would not exist in the 
original database. 

Item Obfuscation-Based: These algorithms [7] hide rules by placing a 
mark “?” (unknowns) in items of some transactions containing restrictive rules, 
instead of deleting such items. In doing so, these algorithms obscure a given 
set of restrictive rules by replacing known values with unknowns. Like the item 
reduction-based algorithms, these algorithms reduce the impact in the sanitized 
databases protecting miners from learning “false” rules. 

The work presented here differs from the related work in some aspects, as 
follows: first, our algorithm addresses the issue of pattern sharing and sanitizes 
rules, not transactions. Second, we study attacks against sensitive knowledge in 
the context of rule sanitization. This line of work has not been considered so far. 
Most importantly, our contribution in rule sanitization and the existing solutions 
in data sanitization are complementary. 



3 Problem Definition 

The specific problem addressed in this paper can be stated as follows: Let D be 
a database, R be the set of rules mined from D based on a minimum support 
threshold tr, and Rr be a set of restrictive rules that must be protected according 
to some security /privacy policies. The goal is to transform R into R' , where R' 
represents the set of non-restrictive rules. In this case, R' becomes the released 
set of rules that is made available for sharing. 
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Sanitizing Algorithms 
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Rules hidden 



Fig. 1. (A): A Taxonomy of Sanitizing Algorithms. (B): Rule Sanitization problems. 



Ideally, R' = R — Rr- However, there could be a set of rules r in R' from 
which one could derive or infer a restrictive rule in Rr. So in reality, R' = 
R — {Rr + Rse)^ where Rse is the set of non-restrictive rules that are removed 
as side effect of the sanitization process to avoid recovery of Rn. 

Figure IB illustrates the problems that occur during the rule sanitization 
process. Problem 1 conveys the non-restrictive rules that are removed as a side 
effect of the process {Rse)- We refer to this problem as side effect. It is related 
to the misses cost problem in data sanitization [4]. Problem 2 occurs when using 
some non-restrictive rules, an adversary may recover some restrictive ones by 
inference channels. We refer to such a problem as recovery factor. 



4 Framework for Protecting Sensitive Knowledge 

Before introducing our framework for protecting sensitive knowledge, we briefly 
review some terminology from graph theory. We present our new sanitizing al- 
gorithm in Section 5.2. 



4.1 Basic Definitions 

The itemsets in a database can be represented in terms of a directed graph. We 
refer to such a graph as frequent itemset graph and define it as follows: 

Definition 1 (Frequent Itemset Graph). A frequent itemset graph, denoted 
by G = {C,E), is a directed graph which consists of a nonempty set of frequent 
itemsets C , a set of edges E that are ordered pairings of the elements of C , such 
that Vm, V € C there is an edge from u to v if uDv = u and if \v\ — |it| = 1 where 
|x| is the size of itemset x. 

Figure 2b shows a frequent itemset graph for the sample transactional database 
depicted in Figure 2a. In this example, the minimum support threshold cr is set 
to 2. As can be seen in Figure 2b, in a frequent itemset graph G, there is an 
ordering for each itemset. We refer to such an ordering as itemset level and define 
it as follows: 
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Fig. 2. (a) A transactional database. (b) The corresponding frequent itemset graph. 



Definition 2 (Itemset Level). Let G = {C,E) be a frequent itemset graph. 
The level of an itemset u, such that u G C , is the length of the path connecting 
an 1-itemset to u. 

Based on Definition 2, we define the level of a frequent itemset graph G as 
follows: 

Definition 3 (Frequent Itemset Graph Level). Let G = {C,E) he a fre- 
quent itemset graph. The level of G is the length of the maximum path connecting 
an 1-itemset u to any other itemset v, such that u,v € G, and u Gv. 

In general, the discovery of itemsets in G is the result of top-down traversal of 
G constrained by a minimum support threshold a. The discovery process employs 
an iterative approach in which fc-itemsets are used to explore (fc -|- l)-itemsets. 

4.2 Taxonomy of Attacks 

An attack occurs when someone mines a sanitized set of rules and, based on non- 
restrictive rules, deduce one or more restrictive rules that are not supposed to be 
discovered. We have identified some attacks against sanitized rules, as follows: 

Forward-Inference Attack: Let us consider the frequent itemset graph in 
Figure 3A. Suppose we want to sanitize the restrictive rules derived from 
the itemset ABC. The naive approach is simply to remove the itemset ABC. 
However, if AB, AC, and BC are frequent, a miner could deduce that ABC 
is frequent. A database owner must assume that an adversary can use any 
inference channel to learn something more than just the permitted associ- 
ation rules. We refer to this attack as forward-inference attack. To handle 
this attack, we must also remove at least one subset of ABC in the level 1 
of the frequent itemset graph. This complementary sanitization is necessary. 
In the case of a deeper graph, the removal is done recursively up to level 1. 
We start removing from level 1 because we assume that the association rules 
recovered from the itemsets have at least 2 items. Thus, the items in level 0 
of the frequent itemset graph are not shared with a second party. In doing 
so, we reduce the inference channels and minimize the side effect. 
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Backward-Inference Attack: Another type of attack occurs when we sanitize 
a non-terminal itemset. Based on Figure 3B, suppose we want to sanitize any 
rule derived from the itemset AC. If we simply remove AC, it is straightfor- 
ward to infer the rules mined from AC since either ABC or ACD is frequent. 
We refer to this attack as backward-inference attack. To block this attack, 
we must remove any superset that contains AC. In this particular case, ABC 
and ACD must be removed as well. 




Fig. 3. (a) An example of forward-inference, (b) An example of backward-inference. 



4.3 Metrics 

In this section, we introduce two metrics related to the problems illustrated in 

Figure IB: The side effect and the recovery. 

Side Effect Factor (SEE): Measures the amount of non-restrictive associa- 
tion rules that are removed as side effect of the sanitization process. The 
side effect factor is calculated as follows: SEF = where R, 

R' , and Rr represent the set of rules mined from a database, the set of san- 
itized rules, and the set of restrictive rules, respectively, and [S'! is the size 
of the set S. 

Recovery Factor (RE): This measure expresses the possibility of an adver- 
sary recovering a restrictive rule based on non-restrictive ones. The recovery 
factor of one pattern takes into account the existence of its subsets. The 
rationale behind the idea is that all nonempty subsets of a frequent item- 
set must be frequent. Thus, if we recover all subsets of a restrictive itemset 
(rule), we say that the recovery factor for such an itemset is possible, thus 
we assign it the value 1. However, the recovery factor is never certain, i.e., 
an adversary may not learn an itemset even with its subsets. On the other 
hand, when not all subsets of an itemset are present, the recovery of the 
itemset is improbable, thus we assign value 0 to the recovery factor. 
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5 Rule Sanitization: The DSA Algorithm 

5.1 The General Approach 

Our approach has essentially three steps as follows. These steps are applied 
after the mining phase, i.e., we assume that the frequent itemset graph G is 
built. The set of all itemsets that can be mined from G, based on a minimum 
support threshold ct, is denoted by C. 

Stepl: Identifying the restrictive itemsets. For each restrictive rule in 
convert it to an itemset € G and mark it to be sanitized. 

Step2: Selecting subsets to sanitize. In this step, for each itemset Ci to be 
sanitized, we compute its item pairs from level 1 in G, subsets of c^. If none of 
them is marked, we randomly select one of them and mark it to be removed. 
StepS: Sanitizing the set of supersets of marked pairs in level 1. The 
sanitization of restrictive itemsets is simply the removal of the set of 
supersets of all itemsets in level 1 of G that are marked for removal. This 
process blocks inference channels. 



5.2 The Downright Sanitizing Algorithm 

In this section, we introduce the Downright Sanitizing Algorithm, denoted by 
DSA. The inputs for DSA are the frequent itemset graph G, the set of all 
association rules R mined from a database G, and the set of restrictive rules 
Rr to be sanitized. The output is the set of sanitized association rules R' . 

Downright _Sanitizing_Algorithm 
Input: G, R, Rr 
Output: R' 

Step 1. For each association rule rvi G Rr do 

1.1. patterrii •«— rri / /Convert each rvi into a frequent itemset patter rii 

Step 2. For each patterrii in the level k of G, where A: > 1 do { 

2.1. Pairs (patterrii) / /Compute all the item pairs of patterrii 

2.2. If (Pairs (patterrii) n MarkedPair = 0) then { 

2.2.1 Pi •<— random(Pairs(patterrii)) / /Select randomly a pair pi G patterrii 

2.2.2 MarkedPair •<— MarkedPair Upi //Update the list MarkedPair 

} 

} 

Step a. R' 4^ R 

3.1. For each itemset a £ G do { 

3.1.1. If 3 a marked pair p, such that p G MarkedPair and p <Z Ci then 

3. 1.1.1. Remove(ci) from R' //ci belongs to the set of supersets of p 

} 

End Algorithm 
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6 Experimental Results 

In this section, we study the efficiency and scalability of DSA and the similar 
counterparts in the literature. There are no known algorithms for rule sanitiza- 
tion. However, transaction sanitization algorithms can be used for this purpose. 
Indeed, in order to sanitize a set of rules R to hide Rr, one can use data saniti- 
zation to transform the database D into D' to hide Rr and then mine D' to get 
the rules to share. We used this idea to compare our algorithm to existing data 
sanitization approaches. 



6.1 Datasets 

We validated DSA against real and synthetic datasets. The real dataset, BMS- 
Web-View-2 [8], placed in the public domain by Blue Martini Software. The 
dataset contains 22,112 transactions with 2717 distinct items, and each customer 
purchasing has four items on average. The synthetic dataset was generated with 
the IBM synthetic data generator. This dataset contains 100,000 transactions 
with 500 different items, and each customer purchasing has ten items on average. 



6.2 Sanitizing Algorithms 

For our comparison study, we selected the best sanitizing algorithms in the liter- 
ature: (1) Algo2a hides restrictive rules by reducing support [2]. (2) Item Group- 
ing Algorithm (IGA) [4] which groups restricted association rules in clusters of 
rules sharing the same itemsets. The shared items are removed to reduce the 
impact on the sanitized dataset; (3) Sliding Window Algorithm (SWA) [6] scans 
a group of K transactions, at a time and sanitizes the restrictive rules present 
in such transactions based on a disclosure threshold ip defined by a database 
owner. We set the window size of SWA to 20000 transactions in both datasets; 
(4) An algorithm similar to DSA, called Naive, which sanitizes restrictive item- 
sets and their supersets, i.e.. Naive blocks the forward-inference attack without 
considering blocking the backward-inference attack. 



6.3 Methodology 

We performed two series of experiments: the first to evaluate the effectiveness of 
DSA, and the second to measure its efficiency and scalability. 

Our comparison study was carried out through the following steps: (1) We 
used the algorithms IGA, SWA, and Algo2a to sanitize both initial databases; 
(2) We applied the Apriori algorithm on the sanitized data to extract the rules 
to share. For DSA, also two steps were necessary: (1) Apply Apriori algorithm 
to extract rules from the two initial datasets; (2) Use DSA to sanitize these 
rules. The effectiveness is measured in terms of restrictive associations that can 
be recovered by an adversary, as well as the proportion of non-restrictive rules 
hidden due to the sanitization. 
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All the experiments were conducted on a PC, AMD Athlon 1900/1600 (SPEC 
CFP2000 588), with 1.2 GB of RAM running a Linux operating system. In our 
experiments, we selected a set of 20 restrictive association rules for the real 
dataset and 30 for the synthetic dataset, with rules ranging from 2 to 5 items. 
The real dataset has 17,667 association rules with support > 0.2% and confidence 
> 50%, while the synthetic dataset has 20,823 rules with support > 0.1% and 
confidence > 50%. 

6.4 Measuring Effectiveness 

In this section, we measure the effectiveness of DSA, IGA, SWA, and Algo2a 
considering the metrics introduced in Section 4.3. 

In order to compare the algorithms under the same conditions, we set the dis- 
closure thresholds ip of the algorithms IGA and SWA, and the privacy threshold 
A of algorithm Algo2a to 0%. In this case, all restrictive rules are completely san- 
itized. We purposely set these thresholds to zero because DSA always sanitizes 
all the restrictive rules. However, the value for the side effect factor differs from 
one algorithm to another. For instance, Figure 4A shows the side effect factor on 
the synthetic dataset. The lower the result the better. For this example, 1.09% 
of the non-restrictive association rules in the case of Naive, 3.58% in the case of 
DSA, 6.48% in the case of IGA, 6.94% in the case of SWA, and 8.12% in the 
case of Algo2a are removed by the sanitization process. 

Similarly, Figure 4B shows the side effect of the sanitization on the real 
dataset. In this situation, 3.2% of the non-restrictive association rules in the 
case of Naive, 4.35% in the case of DSA, 11.3% in the case of IGA, 22.1% in the 
case of SWA, and 27.8% in the case of Algo2a are removed. 

In both cases. Naive yielded the best results, but we still need to evaluate 
how efficient Naive is to block inference channels. We do so below. DSA also 
yielded promising results, while the sanitization performed by Algo2a impacts 
the database more significantly. An important observation here is the results 
yielded by SWA and IGA. Both algorithms benefit from shared items in the 
restrictive rules during the process of sanitization. By sanitizing the shared items 
of these restrictive rules, one would take care of hiding such restrictive rules in 
one step. As a result, the impact on the non-restrictive rules is minimized. In 
general, the heuristic of IGA is more efficient than that one in SWA. This explains 
the better performance of IGA over SWA in both datasets. 

After identifying the side effect factor, we evaluated the recovery factor for 
Naive and DSA. This measure is not applied to IGA, SWA, and Algo2a since 
these algorithms rely on data sanitization instead of rule sanitization. Thus, once 
the data is shared for mining, there is no restriction about the rules discovered 
from a sanitized database. 

In the case of rule sanitization, some inference channels can occur, as dis- 
cussed in Section 4.2. We ran a checklist procedure to evaluate how efficient is the 
sanitization performed by Naive and DSA. We check the existence of any subset 
of the restrictive rules removed in order to identify the recovery factor. If all 
subsets of a rule are found, we assume the rule could be recovered. As expected. 
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Fig. 4. (A): Side effect on the synthetic dataset. (B): Side effect on the real dataset. 



Naive blocked well the forward-inference attacks but failed to block backward- 
inference attacks in both datasets. On the contrary, DSA yielded the best results 
in all cases, i.e., DSA blocked both forward-inference and the backward-inference 
attacks. The results suggested that hardly can an adversary reconstruct the re- 
strictive rules after the sanitization performed by DSA. 

6.5 CPU Time for the Sanitization Process 

We tested the scalability of DSA and the other algorithms vis-a-vis the number 
of rules to hide. We did not plot Naive because its CPU time is very similar 
to that one of DSA. We varied the number of restrictive rules to hide from 20 
to 100 and set the disclosure thresholds to tp = 0%. The rules were randomly 
selected from both datasets. Figures 5A and 5B show that the algorithms scale 
well with the number of rules to hide. Note that IGA, SWA, and DSA increase 
CPU time linearly, while the CPU time in Algo2a grows fast. This is due the 
fact that Algo2a requires various scans over the original database, while IGA 
requires two, and both DSA and SWA require only one. 

Although IGA requires 2 scans, it is faster than SWA in most cases. The 
main reason is that SWA performs a number of operations in main memory 
to fully sanitize a database. IGA requires on scan to build an inverted index 
where the vocabulary contains the restrictive rules and the occurrences contain 
the transaction IDs. In the second scan, IGA sanitizes only the transactions 
marked in the occurrences. Another important result is that IGA and DSA 
yielded very similar CPU time for both datasets. In particular, IGA was better 
in the synthetic dataset because the transactions contain more items and IGA 
requires less operations in main memory. 



6.6 General Discussion 

Our experiments demonstrated the evidence of attacks in sanitized databases. 
The figures revealed that DSA is a promising solution to protect sensitive knowl- 
edge before sharing association rules, notably in the context of rule sanitization. 
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Fig. 5. (A): CPU time for the synthetic dataset. (B): CPU time for the real dataset. 



DSA has a low value for side effect factor and very low recovery factor. We 
have identified some advantages of DSA over the previous data sanitizing al- 
gorithms in the literature as follows: (1) Using DSA, a database owner would 
share patterns (only rules) instead of the data itself; (2) Sanitizing rules, one re- 
duces drastically the possibility of inference channels since the support threshold 
and the mining algorithm are selected previously by the database owner; and 
(3) Sanitizing rules instead of data results in no alteration in the support and 
confidence of the non-restrictive rules, i.e., the released rules have the original 
support and confidence. As a result, the released rules seem more interesting 
for practical applications. Note that the other approaches reduce support and 
confidence of the rules as a side effect of the sanitization process. 

On the other hand, DSA reduces the flexibility of information sharing since 
each time a client (party) wants to try a different set of support and confidence 
levels, it has to request for the rules from the server. 

7 Conclusions 

In this paper, we have introduced a novel framework for protecting sensitive 
knowledge before sharing association rules. 

Our contributions in this paper can be summarized as follows: First, a sanitiz- 
ing algorithm called Downright Sanitizing Algorithm (DSA). DSA blocks some 
inference channels to ensure that an adversary cannot reconstruct restrictive 
rules from the non-restrictive ones. In addition, DSA reduces drastically the side 
effect factor during the sanitization process. Our experiments demonstrated that 
DSA is a promising approach for protecting sensitive knowledge before sharing 
association rules. Second, the framework also encompasses metrics to evaluate 
the effectiveness of the rule sanitization process. Another contribution is a taxon- 
omy of existing sanitizing algorithms. We also introduced a taxonomy of attacks 
against sensitive knowledge. 

The work presented here introduces the notion of rule sanitization, which 
complements the idea behind data sanitization. While data sanitization relies on 
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protecting sensitive knowledge before sharing data, rule sanitization is concerned 
with the sharing of patterns. Currently, we are investigating the existence of new 
type of attacks against sanitized databases, and the effective response to such 
attacks. Another interesting issue to address is the problem of hiding rules in 
collective data. In our previous motivating example, if all clients share their rules 
in the server, but want to hide some global rules, i.e. rules that become confident 
with the collective support, our algorithm seems vulnerable in such context and 
a collaborative sanitization should be explored. 
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Abstract. Although the task of mining association rules has recei- 
ved considerable attention in the literature, algorithms to Hnd time 
association rules are often inadequate, by either missing rules when 
the time interval is arbitrarily partitioned in equal intervals or by 
clustering the data before the search for high-support itemsets is 
undertaken. We present an e.cient solution to this problem that uses 
the fractal dimension as an indicator of when the interval needs to be 
partitioned. The partitions are done with respect to every itemset in 
consideration, and therefore the algorithm is in a better position to find 
frequent itemsets that would have been missed otherwise. We present 
experimental evidence of the e.ciency of our algorithm both in terms 
of rules that would have been missed by other techniques and also in 
terms of its scalability with respect to the number of transactions and 
the number of items in the data set. 

Keywords: Temporal association rules, fractal dimension, intrusion de- 
tection 



1 Introduction 

Association rules have received a lot of attention in the data mining community 
since their introduction in [1]. Association rules are rules of the form X ^ Y 
where X and Y are sets of attribute- values, with Ap|y = 0 and ||F|| = 1. The 
set X is called the antecedent of the rule while the item Y is called consequent. 
For example, in a market-basket data of supermarket transactions, one may find 
that customers who buy milk also buy honey in the same transaction, generating 
the rule milk — >■ honey. There are two parameters associated with a rule: support 
and confidence. The rule X ^ Y has support s in the transaction set T if 
s% of transactions in T contain X U Y. The rule X Y has confidence c 
if c of transactions in T that contain X also contain Y. The most difficult 
and dominating part of an association rules discovery algorithm is to find the 
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itemsets X\JY , that have strong support. (Once an itemset is deemed to have 
strong support, it is an easy task to decide which item in the itemset can be the 
consequent by using the confidence threshold.) 

Algorithms for discovering these “classical” association rules are abundant [2, 
3] . They work well for data sets drawn from a domain of values with no relative 
meaning (categorical values). However, when applied to interval data, they yield 
results that are not very intuitive. Algorithms to deal with interval data have 
appeared in the literature. The first algorithm ever proposed (by Srikant and 
Agrawal [8]), defined the notion of quantitative association rules as rules where 
the predicates may be equality predicates (Attribute = v) or range predicates 
(vi < Attribute < V 2 )- The problem that arises then is how to limit the number 
of ranges (or intervals) that must be considered: if intervals are too large, they 
may hide rules inside portion of the interval, and if intervals are too small they 
may not get enough support to be considered. 

Srikant and Agrawal deal with the problem by doing an Equi-depth initial 
partitioning of the interval data. In other words, for a depth d, the first d values 
(in order) of the attribute are placed in one interval, the next d in a second 
interval, and so on. No other considerations, such as the density of the interval 
or the distance between the values that fall in the same interval are taken. 

Miller and Yang in [7] point out the pitfalls of Equi-depth partitioning. In 
short, intervals that include close data values (e.g., time values in an interval 
[9:00, 9:30]) are more meaningful than those intervals involving distant values 
(e.g., an interval [10 : 00, 21 : 00]). Notice that both types of intervals can coexist 
in an Equi-depth partitioning schema, since the population of values can be 
more dense in the closely tight intervals than in the loosely tight intervals. As 
the authors in [7] point out, it is less likely that a rule involving a loosely tight 
interval (such as [10 : 00, 21 : 00]) will be of interest than rules that involve the 
closely tight intervals. Srikant and Agrawal [8] had pointed out this problem, 
mentioning that their Equi-depth partitioning would not work well with skewed 
data, since it would separate close values that exhibit the same behavior. 

Miller and Yang proceed to establish a guiding principle that we include here 
for completeness: 

Goal 1. [7] In selecting intervals or groups of data to consider, we want a 
measure of interval quality that re.ects the distance between data points. 

To achieve this goal. Miller and Yang, de.ned a more general form of asso- 
ciation rules with predicates expressing that an attribute (or set of attributes) 
ought to fall within a given subset of values. Since considering all possible sub- 
sets is intractable, they use clustering [5] to find subsets that “make sense” by 
containing a set of attributes that are sufficiently close. A cluster is defined in [7] 
as a set of values whose “diameter” (average pairwise distance between values) 
and whose “frequency” (number of values in the cluster) are both bounded by 
pre-specified thresholds. They propose an algorithm that uses BIRCH [12] to 
find clusters in the interval attribute(s) and use the clusters so found as “items” 
which with the items of categorical nature are fed into the a-priori algorithm 
[2] to find high-support itemsets and rules. So, an itemset found in this way 
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could take the form ([9:00, 9:30], bread) meaning that people buy their bread 
frequently between 9:00 and 9:30 am. The interval [9 : 00, 9 : 30] would have been 
found as constituting a cluster by feeding all the time values in the data set to 
BIRCH. 

We believe that Miller’s technique still falls short of a desirable goal. The 
problem is that by clustering the interval attribute alone, without taking into 
consideration the other (categorical) values in the data set, one can fail to dis- 
cover interesting rules. For instance, it is possible that while the itemset ([9 : 00, 
9:30], bread) is a frequent one, the itemset ([9:00, 9:15], bread, honey) is also 
frequent. Using clustering in the time attribute before considering the other 
items in the data set may lead to decide that the interval [9:00, 9 : 30] is to be 
used as an item and therefore the second itemset above will never be found as 
frequent. Another problem is that invoking a clustering algorithm to cluster the 
interval values along the itemset that we are considering (say, for instance, bread 
and honey), would be prohibitively expensive. 

In [11] Apriori algorithm was extended to find association rules which satisfy 
minimum support and confidence within a user-specified interval. In [10], the 
authors proposed to use a density-based clustering algorithm to find interval 
clusters first, then generate temporal association rules with Apriori algorithm. 
This method requires at least two scans of a data set, which may still be too 
expensive for large data sets. More seriously, a pre-clustering step assumes all 
association rules share the same temporal clusters, which may cause the problems 
similar to Miller’s method as discussed above. Fortunately, we have devised a 
method that can do the partitioning of the interval data in an on-line fashion, 
i.e, while we are trying to compute the support of the non-interval items of the 
itemset {bread and honey). This method uses the notion of fractal dimension to 
produce a natural partitioning of the set of interval data. Although we conduct 
our experiments only on temporal interval data, our method can be applied to 
any data sets with one or multiple interval attributes. 

The rest of the paper is divided as follows. Section 2 brie.y reviews the back- 
ground on fractal properties of data sets. Section 3 offers a motivating example 
and our algorithm. In section 4 we show the experimental results of applying our 
technique to a real data set. Finally, Section 5 gives the conclusions and future 
work. 

2 Our Approach 

2.1 Motivation 

To motivate our technique, consider the data set shown in Figure 1, where ti- 
mestamped costumer transactions (baskets) are shown. 

An Equi-depth algorithm such as the one presented in [8] may run the risk of 
breaking the obvious temporal interval [9:00, 9 : 30] and miss the fact that the 
item bread had high support for this entire period of time. Miller and Yang’s 
algorithm will find that the attribute Time contains at least two clusters: points 
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Fig. 1. A timestamped set of baskets. 



in the interval [9 : 00, 9 : 30] and points in the interval [18 : 00, 18 : 40]. (The rest 
of the time points may be grouped in other clusters or considered outliers.) 
Using the first cluster and with support threshold equal to 0.16 ), Miller and 

Yang’s algorithm would find, among others, high support itemsets ([9 : 00, 9 : 30], 
bread, honey), ([9:00, 9:30], milk, cookies) and ([18:00, 18:40], beer, diapers, 
bread, honey). The algorithm would have missed the itemset ([9 : 00, 9 : 15], bread, 
honey) that has support 0.33, and the itemset [9:00, 9:05], milk, cookies) with 
support 0.16. The reason is that this algorithm would have fed the item [9 : 00, 
9 : 30] to the a-priori algorithm, without further consideration for subintervals 
within it, like the [9 : 00, 9 : 15] subinterval, which in this case is a better “fit” 
for the items [bread, honey). (Or the [9:00, 9:05] interval for the items [milk, 
cookies).) 

As pointed out before, it would be extremely expensive to invoke clustering 
of each itemset under consideration due to too many scans of a dataset. However, 
using the fractal dimension we can design a simple, incremental way of deciding 
when to break the interval data for any itemset under consideration. The central 
idea is to modify a-priori in such a way that while the algorithm is measuring 
the support for an itemset whose items are categorical (e.g., [bread, honey)), we 
incrementally compute the fractal dimensions of all (possibly multidimensional, 
since the number of interval attributes in the clusters can be more than 1) 
clusters that include each of the interval attributes where the itemset shows up. 
To illustrate, Figure 2 shows the configuration of this data set using the data of 
Figure 1, while considering the support of the itemset [bread, honey). As each 
of the rows in the data set of Figure 1 is considered, the rows of the cluster (so 
far, we only have two clusters [9 : 00, 9 : 15] and [18 : 24, 18 : 39] which has only 
one interval attribute and we are going to process point 18:40) shown in Figure 
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2 can be built and the fractal dimension of the cluster of rows seen so far can be 
computed. Figure 3 shows the two box-counting plots of the data set of Figure 2. 
The first plot includes all the points up to time 9:15 (the first cluster), while the 
second includes the first cluster plus point 18:40 currently under consideration. 
The first straight region shows a slope of -0.2 for a fractal dimension of 0.2. The 
addition of the 18:40 point decreases the fractal dimension to 0.16. Similarly 
we compute the fractal dimension of second cluster [18 : 24, 18 : 39], and fractal 
dimension of [18 : 24, 18 : 39] plus point 18:40, and the difference is smaller that 
0.01. So point 18:40 is added to the second cluster and the new second cluster will 
be [18:24, 18:40]. When we input next transaction which have itemset {bread, 
honey) with time 20:40, which will cause big changes of fractal dimension to 
both existing clusters, which indicates that we should start a new cluster with 
the new point 20:40. A big change of fractal dimension (suppose we set the change 
threshold as 0.01) is a good indication for a breaking point in the interval data, 
and a good place to compute the support of the itemset composed by the interval 
considered so far ([18 : 24, 18 : 40]) and the other two items {bread, honey). If the 
support is equal or exceeds the threshold, then this itemset would be reported by 
the algorithm. In summary, if the minimum change of fractal dimension for all 
existing clusters is less than a preset threshold, we should put the new point 
into the cluster whose fractal dimension is changed the least, otherwise, we 
should start a new cluster. For the itemset {milk, cookies) there is a change 
after the point 9 : 07. (The itemset receives its strong support before that time.) 
And for the itemset beer, diapers, there is a clear change after the point 18 : 00. 
Indeed, after that point a new interval opens in which the itemset receives strong 
support. Thus, a potential high support itemset would be ([18 : 00, 18 : 40], beer, 
diapers). (This, in fact would reveal an interesting nugget: that this pair of items 
are bought frequently after work hours, for example.) Again, we could not have 
found these itemsets (([9 : 00, 9 : 05], milk, cookies), ([9 : 00, 9 : 15], bread, honey) 
and ([18:00, 18:40], beer, diapers)) by preclustering the time attribute. 

By our method we do not require any order in interval attributes, and it can 
be readily applied to a data set with multiple interval attributes. 
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Fig. 2. The data set resulting selecting the values of the interval attribute (Time) for 
which the itemset {bread, honey) is present in the transaction. 



2.2 Algorithm 

The pseudo-code for the our algorithm is shown in Figure 4. Line 1 selects all 
the 1-itemsets, as in a-priori, but considering only the categorical (non-interval) 
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Fig. 3. Loglog plot of the data set shown in Figure 2 up to the points 9:15 and 18:40 
respectively. The slope of the 9:15 curve is -0.2, indicating a fractal dimension 0.2, 
while the slope of the 18:40 curve is -0.16, for a fractal dimension of 0.16 



items. (Every Lk will contain only categorical itemsets.) Then the algorithm ite- 
rates until no more candidates exist (Line 2), generating the next set of candida- 
tes from Lfc_i (as in a-priori), and for every candidate I in that set initializing 
the multidimensional set Mj as empty and making the lower bound of the inter- 
val for /, lowi equal to — oo. (This will be changed as soon as the first occurrence 
of I is spotted in the data set, in Lines 14-15.) For each transaction (basket) t 
in D, if the transaction contains I, a new row is added to the set Mj containing 
the interval attribute values for t in Line 13. (The notation interval{t) indicates 
extracting the interval data from a tuple t.) Regardless of whether the item is 
in t or not, the overall count for the item interval Mj.totalcount is increased in 
Line 10. This count will serve as the denominator when the support of / within 
the interval needs to be computed. Both, the overall count (I. count) and interval 
count {Mj.count) for the itemset are increased (line 12). The fractal dimension 
of the set M/ is then computed (Line 16), and if a significant change is found, 
a new interval is declared (line 17). The notation last{Mj,interval{t)) indicates 
the interval data point before the current interval{t) in Mj. Line 19 tests if the 
itemset composed by this interval and / has enough support. Notice that both 
the Mj. count and Mj.totalcount need to be reduced by one, since the current 
transaction is not part of the interval being tested. If the support is greater 
than or equal to minsup, the itemset is placed in the output. (Line 19.) Line 
20 re-initializes the counts. Line 22 is the classic test for the overall support in 
a-priori, with N being the total number of transactions in the data set. 

3 Experiments 

The experiments reported in this section were conducted over a Sun Workstation 
Ultra2 with 500 MB. of RAM, running Solaris 2.5. The experiment was done 
over a real data set of sniffed connections to a network, collected for intrusion 
detection purposes. The data is collected over MITRE’s network for the purpose 
of intrusion detection. The attributes in the collected data are: Time (one field 
containing both date and time), source IP address, destination IP address, du- 
ration of the connection, type of the connection (TELNET, Login, ), and name 
of the sensor that reported the connection. All IP addresses and sensor names in 
the data are changed to symbolic names such as sensor! or 111.11.1.1 IP address. 
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(1) Li = { large 1-itemsets, from categorical values}; 

(2) for ( k = 2;Lfc_i ! = 0; A: + +) do 

(3) Ck = generate(Lfc_] ) 

(4) for every / in Ck 

(5) make Mj — 0 for every I in Ck 

(6) make lawi = — oo 

(7) while there are transactions in D do 

(8) consider next transaction t in D 

(9) for every candidate in I in Ck 

(10) Mic.totalcount + + 

(11) if I is in t 

(12) make Mj .minf dchange = +oo 

(13) for every interval cluster c in I in Ck 

(14) I. count + +, Mjc-count + + 

(15) add row with interval{t) to M/c 

(16) if lowi = —00 

(17) lowi = intervat{t) 

(18) compute F{Mic) = fd(Mic) 

(19) if the fractal dimension change is less that Mj .minfdchange 

(20) assign this smaller change to Mj .minfdchange 

(21) assign the number of the cluster with the smallest change to j 

(22) if Mi .minfdchange is larger than t 

(24) output itemset \lov)i,last{Mi,interval{t))\.I 

(25) Mij = interval{t), lowi = interval(t), 

Mi. count = Mi .totalcount = 1 

(26) end 

(27) Lk = {I in Ck \ > minsup } 

(28) output items in Lk 

(29) end 



Fig. 4. Our algorithm. 



A total of 4000 data records were used in this experiment. These records cover 
about 40 hours of network events (telnet, login, etc.). The Time attribute is the 
quantitative or interval attribute. 

We tried the Equi-depth approach [8] and our algorithm on this data. For our 
algorithm, the parameters needed from the user are minimum support, minimum 
window size, and a threshold for the Fractal dimension. The parameters needed 
for the Equi-depth approach are minimum support, minimum confidence, and 
maximum support. We used the same minimum support for both algorithms. 
This parameter then has equal impact on both approaches and this is desirable 
for the comparison of the rules generated by each. For the Equi-depth approach, 
we used a minimum confidence of 50% and after trying different values of maxi- 
mum support we picked the one that gave the best results, that is 12%. Using 
the maximum support in the equations suggested in Srikant and Agrawal’s paper 
[8], the number of intervals is 8. Dividing the 24 hours by 8, our intervals are 
0:00-2:59, 3:00-5:59, 6:00-8:59, and so on. 
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Rule Number 


Rule 


confidence 


1. 


Time = 7/25 — [09 — 12]— > event = TELNET 


91% 


2. 


Time — 7/25 — [12 — 15]— > event = TELNET 


90% 


3. 


Time = 7/25 — [15 — 18]— > event = TELNET 


93% 


4. 


Time = 7/26 - [09 - 12]- > event = TELNET 


90% 


5. 


Time = 7/26 — [12 — 15]— > event = TELNET 


90% 


6. 


dstIP = 222.22.2.2- > srcIP = 111.11.1.1 


100% 


7. 


srcIP = 111.11.1.1- > dstIP = 222.22.2.2 


100% 



Fig. 5. Results of the Equi-depth approach with minimum support = 10%, minimum 
confidence = 50%, and number of intervals = 8 (every 3 hours). Only first 7 rules are 
shown. 



Rule 

Number 


Rulc( window support, overall support) 


1. 


event=TELNET 

[7/25-00:00- 7/25-09:06](89%,ll%), [7/25-09:06-7/25-11:13] (90%,11%) 
[7/25-11:13- 7/25-16:04(91%, 22%), [7/25-16:04-7/26-00:01(93%,ll%) 
[7/26-00:01- 7/26-12:17(91%, 22%), [7/26-12:17- 7/26-16:22(90%, 11%) 




Rule 2 through 15 are omitted due to page limit. 


16. 


duration=0 

[7/25-00:00- 7/25-09:06(19%,2%), [7/25-09:06-7/25-11:13(14%,!%) 
[7/25-11:13- 7/25-16:04(12%, 3%), [7/25-16:04-7/26-00:01(12%,!%) 
[7/26-00:01- 7/26-12:17(13%,3%), [7/26-12:17-7/26-16:22(11%,!%) 




Rule 17 through 21 are omitted drie to page limit. 



Fig. 6. Results of our algorithm with minimum support = 10.00%, minimum window 
size = 500, FD threshold = 0.02. 



For our algorithm, we tried di.erent thresholds values. The threshold specifies 
the amount of change that should be seen in the fractal dimension before the 
algorithm draws a line and starts a new window. The bigger this number is, 
the less number of rules are generated. The minimum window size forces the 
window to have at least the specified minimum number of records regardless 
of the change in fractal dimension. A bigger window size reduces the execution 
time but might miss some rules. 

The results generated by each approach in our experiment are shown in 
Figures 5 and 8 and compared below. We show the output rules in groups so 
they can be matched and compared easily. For example, rules 1 through 5 in 
the Equidepth output are comparable to rule 1 in the Fractals output. To read 
the rules, Equi-depth rule 1 shows that 91% of network connections on July 
25 between 9am to 12pm are TELNET connections. Fractals rule 1 shows 89% 
of connection from 12am (midnight) to 9:06:43 am on July 25 are TELNET 
connections; it also shows that this is 11% of all connections in the data set. 
Note that Fractal rules are all associated with the interval attribute (Time) 
while the Equi-depth rules show such associations only if they make the minimum 
support and minimum confidence. As a result, the Equi-depth rules 6 and 7 show 
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associations between source IP= 111.11.1.1 and destination IP = 222.22.2.2 but 
their associations with the interval attribute, as shown by Fractals rule 2, are 
missed. An itemset’s association with the interval attribute. Time in this case, 
could be an important one. For example, the fact that 26% of connections after 
business hours from 16:04 to midnight (as shown by third line under Fractals 
rule 2) have occurred between source IP address 111.11.1.1 and destination IP 
address 222.22.2.2, is an interesting rule which is worth further investigation 
(since normally one expects less number of connections after business hours). 

Fractals rules 14 through 21 are associations found by the Fractals approach 
and not found by the Equi-depth approach. Some of these rules could be impor- 
tant. For example Fractals rule 16 shows connections with 0 seconds connection 
duration. A connection with very short duration is an important characteristic 
in intrusion detection. The first line under rule 16 shows 19% of the connections 
between midnight and 9 in the morning (on July 25) have occurred with a du- 
ration of 0. Again this is a rule that is worth further investigation. Note that 
this association, [7/25 — 00 : 00 : 00 — 7/25 — 09 : 06 : 43]— > duration = 0, 
has an overall support of 2% and would never make it through the filter of 10% 
minimum support. 

4 Conclusions 

We have presented here a new algorithm to efficiently find association rules for 
data sets that contain one dimension of interval values. Previous algorithms dealt 
with this problem by dividing the interval data in Equi-depth intervals, risking 
missing some important rules, or by clustering the interval data before using 
the classical a-priori approach, and using the intervals found by the clustering 
algorithm as items. In doing so, these intervals remain defined for the rest of 
the algorithm, independently on whether for certain itemsets it would have been 
more adequate to consider a subinterval. We have successfully implemented the 
code shown in Figure 4 and conducted experiments in synthetic and real data 
sets with it. Our results show that this technique can efficiently find frequent 
itemsets that were missed by previous approaches. Also, our results demonstrate 
that we are able to find important associations that could be missed by previous 
algorithms for interval data. 
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Abstract. Constraint based mining finds all itemsets that satisfy a set 
of predicates. Many constraints can be categorised as being either mono- 
tone or antimonotone. Dualminer was the first algorithm that could 
utilise both classes of constraint simultaneously to prune the search 
space. In this paper, we present two parallel versions of DualMiner. The 
ParaDualMiner with Simultaneous Pruning efficiently distributes the 
task of expensive predicate checking among processors with minimum 
communication overhead. The ParaDualMiner with Random Polling 
makes further improvements by employing a dynamic subalgebra par- 
titioning scheme and a better communication mechanism. Our experi- 
mental results indicate that both algorithms exhibit excellent scalability. 



1 Introduction 

Data mining in the presence of constraints is an important problem. It can pro- 
vide answers to questions such as “find all sets of grocery items that occur more 
than 100 times in the transaction database and the maximum price of the items 
in each of those sets is greater than 10 dollars”. To state our problem formally, 
we denote an item as i. A group of items is called an itemset, denoted as S. The 
list of items that can exist in our database is denoted as / = ■ ■ ■ ,in}, 

SCI. The constraints are a set of predicates {Pi, P 2 , . . . , P„} that have to be 
satisfied by an itemset S. Constraint based mining finds all sets in the powersets 
of / that satisfy Pi A P 2 A . . . A P„ . 

Many constraints can be categorised into being either monotone or anti- 
monotone constraints [8,10]. There exist many algorithms that can only use one 
of them to prune the search space. There are also algorithms that can mine 
itemsets using these two constraint categories in a sequential fashion (e.g. [13]). 
DualMiner was the first algorithm able to interleave both classes of constraint 
simultaneously during mining [4] . Nevertheless, the task of finding itemsets that 
satisfy both classes of constraint is still time consuming. High performance com- 
putation offers a potential solution to this problem, provided that an efficient 
parallel version of DualMiner can be constructed. 

Contributions: In this paper, we introduce two new parallel algorithms which 
extend the original serial DualMiner algorithm [4]. We have implemented our 
algorithms on a Compaq Alpha Server SC machine and they both show excellent 
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scalability. To the best of our knowledge, our algorithms are the first parallel 
algorithms that can perform constraint-based mining using both monotone and 
antimonotone constraints simultaneously. 

2 Preliminaries 

We now provide a little background on the original DualMiner algorithm [4]. 
Formally, given itemsets M ,J and S', a constraint is antimonotone if 

VS, J:{{JCSCM)A P{S)) ^ P{J) 

One of the most widely cited antimonotone constraints is support(S) > c. A 
constraint is monotone if 

VS, J : ((S C J C M) A Q(S)) ^ Q{J) 

Monotone constraints are the opposite of antimonotone constraints. Therefore, a 
corresponding example of a monotone constraint is support(S) < d. A conjunc- 
tion of antimonotone predicates is antimonotone and a conjunction of monotone 
predicates is monotone [4]. Therefore, many itemset mining problems involving 
multiple constraints can be reduced to looking for all itemsets that satisfy a 
predicate of the form P{S) A Q{S). Even though some constraints are not ex- 
actly monotone or antimonotone constraints, previous research indicates that 
they can be approximated to be either monotone or antimonotone if some as- 
sumptions are made [10]. According to previous work, the search space of all 
itemsets forms a lattice. Given a set I with n items, the number of elements in 
the lattice is 2". This is equal to the number of elements in the powerset of /, 
which is denoted as 2”. By convention, the biggest itemset is at the bottom of 
this lattice and the smallest itemset will always be at the top. Beside that, our 
search space also forms a Boolean algebra with maximal element B and minimal 
element T. It has the following properties (i)A £ P (ii) B = \JX which is the 
bottom element of P (iii) T = p| A which is the top element of P (iv) for any 
A G P, A = B\A. A subalgebra is a collection of elements C 2” closed under fj 
and IJ. The top and bottom element of the algebra is sufficient to represent all 
the itemsets in between them. If the top and bottom elements satisfy both con- 
straints, the monotone and antimonotone properties guarantee that all itemsets 
in between them will satisfy both constraints. A good subalgebra is a subalge- 
bra which has top and bottom elements that satisfy both the antimonotone and 
monotone constraints. 



Overview of DualMiner. DualMiner builds a dynamic binary tree when 
searching for all good subalgebras. A tree node represents a subalgebra, but not 
necessarily a good subalgebra. Each tree node r consists of three item lists which 
are (i) IN{t) representing all the items that must be in the subalgebra and the 
top element of current subalgebra, T, (ii) CHILD{t) representing all the items 
that have not been apportioned between IN{t) and OUT{t) and (iii) OUT{t) 
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representing all the items that cannot be contained in the current subalgebra. Be- 
cause our search space forms a Boolean algebra, OUT represents the bottom ele- 
ment of the current subalgebra, B. Note that OUT(t) = {IN{t)UCHILD{t)}. 

When DualMiner starts, it will create a root node with IN (a) and OUT {a) 
empty . It will start checking the top element of the current subalgebra first. 
If the top element does not satisfy the antimonotone constraint, every itemset 
below it will not satisfy the constraint too. Therefore, we can eliminate the 
subalgebra. If it satisfies the antimonotone constraint, DualMiner will check all 
the itemsets below T in the subalgebra using the antimonotone constraint. Each 
is of the form IN U {X}, where X is an item from the CHILD item list. If 
all itemsets IN U {X} satisfy the constraint, no item list will be altered. If an 
itemset does not satisfy the constraint, X will be put into OUT item list. This 
effectively eliminates the region that contains that item. 

Next, the algorithm will apply the monotone constraint on B of the current 
subalgebra. If the maximal itemset fails, the algorithm eliminates the current 
subalgebra immediately. If it does not fail, DualMiner will start checking all the 
itemsets one level above the current bottom itemset using the monotone con- 
straint. Each is of the form OUT U {X}, where X is an item from the CHILD 
item list. If all itemsets OUT U {X} satisfy the constraint, no item list will be 
altered. If an itemset does not satisfy the constraint, X will be put into the IN 
item list. This eliminates the region that does not contain that item. 

The pruning process will continue until no pruning can be done. At the end 
of the pruning phase, the top itemset must satisfy the antimonotone constraint 
and the bottom itemset must satisfy the monotone constraint. If the top itemset 
also satisfies the monotone constraint and the bottom itemset also satisfies the 
antimonotone constraint, we have found one good subalgebra. If this is not the 
case, DualMiner will partition the subalgebra into two halves. This is done by 
firstly creating two child tree nodes and picking an item from the CHILD 
itemset and inserting it into the IN of one child and OUT of another child. The 
algorithm will mark the current parent node as visited and proceed to the child 
nodes. The process is repeated until all nodes are exhausted. 



3 Parallel DualMiner 

DualMiner does not prescribe specific constraints that have to be used. The 
antimonotone constraint and the monotone constraint are two types of predicates 
over an itemset or oracle functions that return true or false. To simplify our 
implementation, we will use support(S) > C as our antimonotone constraint 
and support(S) < D as our monotone constraint. C < D. We represent the 
database with a series of bit vectors. Each bit vector represents an item in the 
database. The support count of an itemset can be found by performing a bitwise 
AND operation on the bit vector of each item in the itemset. This approach has 
been used by many other algorithms[5,9]. We represent the IN, CHILD and 
OUT itemlist in each node as three bit vectors. Each item is represented as 1 



ParaDualMiner: An Efficient Parallel Implementation 



99 



bit in each of the three bit vectors. The position of the bit will indicate the item 
id of the item. 




Fig. 1. ParaDualMiner with Simultaneous Pruning 



ParaDualMiner with Simultaneous Pruning. In the original DualMiner 
paper, it was observed that the most expensive operation in any constraint based 
mining algorithm is the oracle function. Therefore, we can achieve great perfor- 
mance gain if we can distribute the call to the oracle function evenly among 
different processors. We notice that after DualMiner verifies that the top ele- 
ment of a subalgebra satisfies the antimonotone constraint, it will check whether 
all the elements one level below the top element satisfies the antimonotone con- 
straint. Each of them is of the form JiV lJ{Ar}, where X is any item from the 
CHILD itemset. Since each oracle function call is independent, it is possible to 
partition the CHILD item list and perform the oracle function call simultane- 
ously. e.g. Given the following transaction database: Transaction 1={A,B,C,D}, 
Transaction 2 = {A,B,C} and suppose our constraints are support(S) > 1 and 
support(S) < 3, the execution of the algorithm is illustrated in figure 1. Be- 
fore partitioning the CHILD item list, all the processors will have the same 
IN, CHILD and OUT item list. After the parallel algorithm distributes the an- 
timonotone constraint checking among different processors, any itemset of the 
form /iVy {A} such as {D}, that does not satisfy the antimonotone constraint, 
will lead to an item being inserted into the OUT item list in order to prune 
away that part of the search space. Therefore, at the end of the simultaneous 
antimonotone checking, the item lists that will be altered are the CHILD and 
OUT item list. 

Before proceeding, we must merge the search space that has not been pruned 
away using the antimonotone constraint. It only has to perform a bitwise boolean 
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OR operation on the CHILD and OUT item lists of all processors. This will 
give us the global result based on the individual processor pruning process. This 
simplification is the direct result of us choosing the bit vector as our item list 
representation. The merging operation can be done between all processors using 
the MPLAllreduce function, with boolean OR as the operator. 

A similar process can be applied when we are pruning the search space us- 
ing the monotone constraint. The difference is the algorithm is partitioning and 
merging the IN and CHILD item lists instead of the OUT and CHILD item- 
lists. The partitioning operation is entirely a local operation. There is no message 
passing involved. Each processor will check the number of items in the CHILD 
item list, the number of processors and its own rank to consider which part of 
the CHILD itemlist to be processed. If it can divide the number of items evenly, 
each processor will perform an equal amount of oracle calls. However, this will 
happen only if the number of processors is a perfect multiple of the number of 
items in the CHILD item list. If the algorithm cannot divide the CHILD item 
list evenly, the algorithm will distribute the residual evenly to achieve optimal 
load balancing. Therefore, the maximum idle time for processors each time the 
algorithm distributes the task of oracle function call will be Toracie- 

ParaDualMiner with Random Polling. There are a number of parallel 
frequent pattern miners that use the concept of candidate partitioning and mi- 
gration to distribute task among processors (e.g. [7,1]). We can see similar be- 
haviour in DualMiner. Whenever DualMiner cannot perform any pruning on the 
current subalgebra using both constraints, DualMiner will split the current sub- 
algebra into two halves by splitting the tree node. This node splitting operation 
is essentially a divide-and-conquer strategy. No region of the search space has 
been eliminated in the process. Therefore, the algorithm permits an arbitrary 
amount splitting of subalgebras subject to the condition that they are to be 
evaluated later. The number of splits permitted is equal to the number of items 
in the CHILD item list. This intuition gives us the simplest form of a parallel 
subalgebra partitioning algorithm. 

In the 2 processor case, the original search space is partitioned into two 
subalgebras. Both processors can turn off 1 bit in the CHILD. One puts it in the 
IN item list by turning on the similar bit in the IN bit vector. Another processor 
will put it in the OUT item list by turning on the similar bit. The two processors 
search two disjoint search spaces without any need for communication. After 
the original algebra has been partitioned, each processor will simultaneously run 
Dualminer locally to find itemsets that satisfy both constraints. Since our search 
space can be partitioned up to the number of items in the CHILD itemlist, this 
strategy can be applied to cases with more than two processors. The number 
of processors that are needed must be 2", where n is the number of times the 
splitting operation has been performed. The partitioning operation is a local 
process. Each processor will only process one of the subalgebras according to its 
own rank. There is no exchange of messages. 

This algorithm will only achieve perfect load balancing if the two nodes 
contain equal amounts of work. This is unlikely because it is unlikely that the 



ParaDualMiner: An Efficient Parallel Implementation 



101 



search space of each processor is even. One of the processors may terminate 
earlier than the rest of processors. Without any dynamic load balancing, the 
processor will remain idle throughout the rest of the execution time. This leads 
to poor load balancing and longer execution time. To overcome this problem, 
we can view a node as a job parcel. Instead of letting a free processor stay idle 
throughout the execution time, the new algorithm can delegate one of the nodes 
to an idle processor to obtain better load balancing. 

There are two ways to initiate task transfer between processors. They are 
sender-initiated and receiver-initiated methods [6]. Our study indicated that the 
receiver-initiated scheme outperformed the sender-initiated scheme. The reason 
for this is that the granularity of time when a processor is idle in the receiver- 
initiated scheme is large. DualMiner spends most of its time in the pruning 
process. Therefore, the time a processor takes before it splits a node can be 
very long. If a processor terminates very early at the start of its own prun- 
ing process, it has to passively wait for another processor splits a node and 
sends it. This greatly decreases the work done per time unit which leads to 
poor speedup. Instead of being passive, the idle processor should poll for a job 
from a busy processor. DualMiner can split a node anywhere, provided that 
we do the splitting and job distribution properly, e.g. Given the transaction 
database Transaction 1 = {A,B,C,D}, Tranaction 2= {C,D} and the constraint 
is support(S) > 1 and support(S) < 3,the set of itemsets that satisfies both 
constraints is {{G}, {D}, {G, D}}. 

The original search space will firstly be split into two subalgebras as shown 
in figure 2. Since itemset {A} is infrequent, processor one that processes the 
left node will finish earlier. Processor two that processes right node could be 
still within one of the pruning functions. Instead of staying idle, processor one 
should then poll for a job from processor two. 




Fig. 2. Subalgebra Partitioning with Random Polling 
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Suppose processor two is priming the search space using the antimonotone 
constraint. This implies that the top element of the current subalgebra such as 
{} has already satisfied the antimonotone constraint. Otherwise, this subalgebra 
would have been eliminated. Therefore, while it is evaluating all the elements 
one level below the top element, it can check for an incoming message from pro- 
cessor one. Suppose it finds that processor one is free after evaluating itemset 
{B}, it can split the subalgebra and send it to processor one as shown in figure 
2. In this case, the subalgebra is further split into two smaller subalgebras and 
can be distributed between these two processors. When the algorithm is prun- 
ing the search space using the antimonotone constraint, the IN itemset must 
have already satisfied the antimonotone constraint. Therefore, in this example, 
if processor two continues pruning using the antimonotone constraint, processor 
two should pick the right node. Likewise, if the algorithm is pruning the search 
space using monotone constraint, the sender processor should pick the left node. 
Suppose that processor two has already split a node and there is already a node 
or subalgebra that is yet to be processed. The algorithm should send that node 
to the idle processor instead of splitting the current one that it is working on. 
This is because the size of the subalgebra that is yet to be processed is equal to 
or greater than the size of the current subalgebra, if we are using a depth first 
or breadth first traversal strategy . 

There are mainly two approaches to extend this algorithm to multiple pro- 
cessors. They are categorised into decentralised and centralised schemes [15]. 
To simplify our implementation, we have adapted the master-slave scheme. The 
master processor acts as an agent between all the slave processors. Each slave 
processor will work on the subalgebras that are assigned to it simultaneously. 
However, it will anticipate an incoming message from the master. Whenever a 
slave processor runs out of jobs locally, it will send a message to the master. The 
master processor will then poll for a job from a busy processor. Therefore the 
master has to know which processors are busy and which are idle. 

For this purpose, we keep a list of processors that are busy and idle. The list 
can be efficiently represented as a bit vector. The bit position of the vector will 
then be the rank of the processor. A busy processor will be denoted as 1 in the 
bit vector. A free processor will be denoted as 0 in the bit vector. Whenever the 
master receives a message from the processor X, it will initialise bit X to zero. 
It will then select a processor to poll for job. There are various way to select 
a processor. A random selection algorithm has been found to work very well in 
many cases. Also, there is previous work that analyses the complexity of this 
kind of random algorithm [11,12,11]. Therefore, we have used this algorithm 
in our implementation. The master will randomly generate a number between 0 
and n — 1, where n is the number of processors. It will then send a free message 
to the selected processor to poll for a job. If the selected slave processor does not 
have any job, it will send a free message to the master processor. The master 
processor will then mark it as free and put it into a free CPU queue. It will 
continue polling until a job message is replied to it. It will then send the job 
to a free processor. The slave processors can detect incoming messages using 
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the MPI Jprobe function. The termination condition is when the number of free 
processors in the list is equal to the number of processors available. This implies 
that there is no outstanding node to be processed. The master processor will 
then send a termination message to all the slave processors. 



4 Experiments 

We implemented both serial and parallel versions of DualMiner on a 128- 
processor Unix Cluster. The specification of the machine is 1 Terabyte of shared 
file storage, 128 Alpha EV68’s at 833 MHz processor, a Quadrics interconnect 
which has 200 Megabyte/sec bandwidth and 6 milliseconds latency and 64 Giga- 
bytes of memory. We used databases generated from the IBM Quest Synthetic 
data generator. The datasets generated from it are used in various papers [4,2, 
14]. The number of transactions is 10000. The dimension of the dataset is 100000, 
which means maximum number of distinct items in the transaction database is 
100000. The length of a transaction is determined by a Poisson distribution with 
a parameter, average length. The average length of transactions is 10. We also 
scaled up the dimension of dataset by doubling the average length of transac- 
tions. This is because the computing resources demanded is significantly higher 
if the dataset is dense [3]. For our purpose, we define a dataset with an average 
transaction length of 10 to be sparse and one with an average length of 20 to be 
dense. 

In the Random Polling version of ParaDualMiner, the master node only acts 
as an agent between all the slave processors. It does not run the DualMiner like 
the slave processors. At the start of algorithm, the original algebra is partitioned 
into 2" part. This means this version of ParaDualMiner can only accept 2” 
processors for the slaves and one additional processor for the master. Therefore, 
we studied our algorithm with 2,3,5,9,17 processors. 




Fig. 3. Support between 0.015 and 0.03 percent and sparse dataset 



Results. As shown in figure 3 and figure 4, the execution time of ParaDualMiner 
with Random Polling is almost identical to the serial version of DualMiner even 
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Fig. 4. Support between 0.015 and 0.03 percent and dense dataset 



though it is using 2 processors. The reason is that the master in Random ParaD- 
ualMiner does not perform any task, besides acting as an agent between all the 
slave processors. Therefore, in the 2 processors case, there is only one processor 
working on the original algebra. When there are 3 processors, there will be two 
slave processors that work on the subalgebra distributed to them. Since the mas- 
ter will always poll for job from the busy processor and it is always possible to 
split work, the load balancing is excellent. Beside that, every communication is 
point to point communication and the message is relatively small. This leads to 
almost perfectly linear speedup after 2 processors. We also observe that there is 
super linear speedup in some cases. This is due to better memory usage. When- 
ever a processor has relatively more nodes to process, the nodes will migrate to 
other processors with less work load. This distributes the memory requirement 
among all the processors. 

From figure 3 and figure 4, we can see that ParaDualMiner with Simultane- 
ous Pruning is not as scalable as Random ParaDualMiner. The reason is that 
whenever there is n processors, there will be exactly n processors that will split 
the most computational intensive part of the algorithm which is the oracle func- 
tion call. However, the algorithm will only achieve perfect load balancing if the 
number of items in the CHILD itemlist is a perfect multiple of the number 
of processors. As the number of processors increases, the chances of getting a 
perfect multiple of the number of processors decreases. This implies the chance 
of having some processors stay idle for one oracle function call becomes larger. 
This may cause many processors to become idle too often, which impairs the 
parallelism that can be achieved by this algorithm. Also, the algorithm only par- 
allelises the oracle function call. Furthermore, there is a need to have all-to-all 
communication to merge all the result of pruning. This is much more expensive 
than point-to-point communication in ParaDualMiner with Random Polling. 



5 Conclusion 

We have proposed two parallel algorithms for mining itemsets that must sat- 
isfy a conjunction of antimonotone and monotone constraints. There are many 
serial or parallel algorithms that take advantage of one of these constraints. 
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However, both of our parallel algorithms are the first parallel algorithms that 
take advantage of both constraints constraints simultaneously to perform con- 
straint based mining. Both algorithms demonstrate excellent scalability. This is 
backed by our experimental result. We are currently investigating the scalability 
of both algorithms using hundreds of processors. Also, we are investigating how 
ParaDualMiner performs if we use other type of constraints. We believe that 
both parallel algorithms should perform well, because there is no reliance on the 
underlying nature of the constraints. 
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Abstract. Collaborative filtering (CF) has proved to be one of the most 
effective information filtering techniques. However, as their calculation 
complexity increased quickly both in time and space when the record in user 
database increases, traditional centralized CF algorithms has suffered from their 
shortage in scalability. In this paper, we first propose a novel distributed CF 
algorithm called PipeCF through which we can do both the user database 
management and prediction task in a decentralized way. We then propose two 
novel approaches: significance refinement (SR) and unanimous amplification 
(UA), to further improve the scalability and prediction accuracy of PipeCF. 
Finally we give the algorithm framework and system architecture of the 
implementation of PipeCF on Peer-to-Peer (P2P) overlay network through 
distributed hash table (DHT) method, which is one of the most popular and 
effective routing algorithm in P2P. The experimental data show that our 
distributed CF algorithm has much better scalability than traditional centralized 
ones with comparable prediction efficiency and accuracy. 



1 Introduction 

Collaborative filtering (CF) has proved to be one of the most effective information 
filtering techniques since Goldberg et al [1] published the first account of using it for 
information filtering. Unlike content-based filtering, the key idea of CF is that users 
will prefer those items that people with similar interests prefer, or even that dissimilar 
people don’t prefer. The k-Nearest Neighbor (KNN) method is a popular realization 
of CF for its simplicity and reasonable performance. Up to now, many successful 
applications have been built on it such as GroupLens [4], Ringo [5]. However, as its 
computation complexity increased quickly both in time and space as the record in the 
database increases, KNN-based CF algorithm suffered a lot from its shortage in 
scalability. 

One way to avoid the recommendation-time computational complexity of a KNN 
method is to use a model-hased method that uses the users’ preferences to learn a 
model, which is then used for predications. Breese et al utilizes clustering and 
Bayesian network for a model-hased CF algorithm in [3]. Its results show that the 
clustering-hased method is the more efficient but suffering from poor accuracy while 
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the Bayesian networks prove only practical for environments in which knowledge of 
user preferences changes slowly. Further more, all model-based CF algorithms still 
require a central database to keep all the user data which is not easy to achieve 
sometime not only for techniques reasons but also for privacy reasons. 

An alternative way to address the computational complexity is to implement KNN 
algorithm in a distributed manner. As Peer-to-Peer (P2P) overlay network gains more 
and more popularity for its advantage in scalability, some researchers have already 
begun to consider it as an alternative architecture [7,8,9] of centralized CF 
recommender system. These methods increase the scalability of CF recommender 
system dramatically. Flowever, as they used a totally different mechanism to find 
appropriate neighbors than KNN algorithms, their performance is hard to analyze and 
may be affected by many other factors such as network condition and self- 
organization scheme. 

In this paper we solve the scalability problem of KNN-based CF algorithm by 
proposing a novel distributed CF algorithm called PipeCF which has the following 
advantage: 

1. In PipeCF, both the user database management and prediction computation task 
can be done in a decentralized way which increases the algorithm’s scalability 
dramatically. 

2. PipeCF keeps all the other features of traditional KNN CF algorithm so that the 
system’s performance can be analyzed both empirically and theoretically and the 
improvement on traditional KNN algorithm can also be applied here. 

3. Two novel approaches have been proposed in PipeCF to further improve the 
prediction and scalability of KNN CF algorithm and reduce the calculation 
complexity from to where M is the user number in the database and N is the items 
number. 

4. By designing a heuristic user database division strategy, the implementation of 
PipeCF on a distributed-hash-table (DHT) based P2P overlay network is quite 
straightforward which can obtain efficient user database management and retrieval 
at the same time. 

The rest of this paper is organized as follows. In Section 2, several related works 
are presented and discussed. In Section 3, we give the architecture and key features of 
PipeCF. Two techniques: SR and UA are also proposed in this section. We then give 
the implementation of PipeCF on a DHT-based P2P overlay network in Section 4. In 
Section 5 the experimental results of our system are presented and analyzed. Finally 
we make a brief concluding remark and give the future work in Section 6. 



2 Related Works 

2.1 Basic KNN-Based CF Algorithm 

Generally, the task of CF is to predict the votes of active users from the user database 
which consists of a set of votes corresponding to the vote of user i on item j. The 
KNN-based CF algorithm calculates this prediction as a weighted average of other 
users’ votes on that item through the following formula: 
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Where j denotes the prediction of the vote for active user a on item j and n is 
the number of users in user database. V. is the mean vote for user i as: 
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Where /, is the set of items on which user i has voted. The weights nj{a,j) 

reflect the similarity between active user and users in the user database. K is a 
normalizing factor to make the absolute values of the weights sum to unity. 



2.2 P2P System and DHT Routing Algorithm 

The term “Peer-to-Peer” refers to a class of systems and applications that employ 
distributed resources to perform a critical function in a decentralized manner. Some of 
the benefits of a P2P approach include: improving scalability by avoiding dependency 
on centralized points; eliminating the need for costly infrastructure by enabling direct 
communication among clients. As the main purpose of P2P systems are to share 
resources among a group of computers called peers in a distributed way, efficient and 
robust routing algorithms for locating wanted resource is critical to the performance 
of P2P systems. Among these algorithms, distributed hash table (DHT) algorithm is 
one of the most popular and effective and supported by many P2P systems such as 
CAN [10], Chord [11], Pastry [12], and Tapestry [13]. 

A DHT overlay network is composed of several DHT nodes and each node keeps a 
set of resources (e.g., files, rating of items). Each resource is associated with a key 
(produced, for instance, by hashing the file name) and each node in the system is 
responsible for storing a certain range of keys. Peers in the DHT overlay network 
locate their wanted resource by issue a lookup(key) request which returns the identity 
(e.g., the IP address) of the node that stores the resource with the certain key. The 
primary goals of DHT are to provide an efficient, scalable, and robust routing 
algorithm which aims at reducing the number of P2P hops, which are involved when 
we locate a certain resource, and to reduce the amount of routing state that should be 
preserved at each peer. 



3 PipeCF : A Novel Distributed CF Algorithm 

3.1 Basic PipeCF Algorithm 

The first step to implement CF algorithm in a distributed way is to divide the original 
centralized user database into fractions which can then be stored in distributed peers. 
For concision, we will use the term bucket to denote the distributed stored fraction of 
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user database in the following of this paper. Here, two critical problems should be 
considered. The first one is how to assign each bucket with a unique identifier through 
which they can be efficiently located. The second is which bucket should be retrieved 
when we need to make prediction for a particular user. 

Here, we solve the first problem by proposing a division strategy which makes 
each bucket hold a group of users’ record who has a particular <ITEM_ID, VOTE> 
tuple. It means that users in the same bucket at least voted one item with the same 
rating. This <ITEM_ID, VOTE> will then be used to a unique key as the identifier for 
the bucket in the network which we will describe in more detail in Section 4. 

To solve the second problem, we propose a heuristic bucket choosing strategy by only 
retrieving those buckets whose identifiers are the same with those generated by the 
active user’s ratings. Eigure 1 gives the framework of PipeCE. Details of the function 
of lookup(A:ey) and implemention of PipeCE on DHT-based P2P overlay network will 
be described in Section 4. 

The bucket choosing strategy of PipeCE is based on the assumption that people 
with similar interests will at least rate one item with similar votes. So when making 
prediction, PipeCE only uses those users’ records that are in the same bucket with the 
active user. As we can see in Eigure 5 of section 5.3.1, this strategy have very high 
hitting ratio. Still, we can see that through this strategy we reduce about 50% 
calculation than traditional CE algorithm and obtain comparable prediction as shown 
in Figure 6 and 7 in section 5. 



Algorithm: PipeCF 

Input: rating record of the active user, target item 
Output: predictive rating for target item 

Method: 

For Each <ITEM_ID, VOTE> tuple in the rating record of active user: 

1) Generate the key corresponding to the <ITEM_ID, VOTE> 
through the hash algorithm used by DHT 

2) Find the host which holds the bucket with the identifier key 
through the function lookup(key) provided by DHT 

3) Copy all ratings in bucket key to current host 

Use traditional KNN-based CF algorithm to calculate to predictive 
rating for target item. 



Fig. 1. Framework of PipeCF 



3.2 Some Improvement 

3.2.1 Significance Refinement (SR) 

In the basic PipeCF algorithm, we return all users which are in the at least one same 
bucket with the active user and find that the algorithm has an 0(N) fetched user 
number where N is the total user number as Figure 7 shows. In fact, as Breese 
presented in [3] by the term inverse user frequency, universally liked items are not as 
useful as less common items in capturing similarity. So we introduce a new concept 
significance refinement (SR) which reduces the returned user number of the basic 
PipeCF algorithm by limiting the number of returned users for each bucket. We term 
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the algorithm improved by SR as Return K which means “for every item, the PipeCF 
algorithm returns no more than K users for each bucket”. The experimental result in 
Figure 7 and 8 of section 5.3.3 shows that this method reduces the returned user 
number dramatically and also improves the prediction accuracy. 



3.2.2 Unanimous Amplification (UA) 

In our experiment in KNN-based CF algorithm, we have found that some highly 
correlated neighbors have little items on which they vote the same rating with the 
active users. These neighbors frequently prove to have worse prediction accuracy than 
those neighbors who have same rating with active users but relatively lower 
correlation. So we argue that we should give special award to the users who rated 
some items with the same vote by amplify their weights, which we term Unanimous 
Amplification. We transform the estimated weights as follows: 
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Where N ^ ■ denotes the number of items which user a and user i have the same votes. 

A typical value for OC for our experiments is 2.0, f5 is 4.0, and y is 4. 
Experimental result in Figure 9 of section 4.3.4 shows that UA approach improves the 
prediction accuracy of the PipeCF algorithm. 



4 Implemention of PipeCF on a DHT-Based P2P Overlay 
Network 

4.1 System Architecture 

Figure 2 gives the system architecture of our implementation of PipeCF on the DHT- 
based P2P overlay network. Here, we view the users’ rating as resources and the 
system generate a unique key for each particular <ITEM_1D, VOTE> tuple through 
the hash algorithm, where the ITEM_ID denotes identity of the item user votes on and 
VOTE is the user’s rating on that item. As different users may vote particular item 
with same rating, each key will correspond to a set of users who have the same 
<ITEM_ID, VOTE> tuple. As we stated in section 3, we call such set of users’ record 
as bucket. As we can see in Eigure 2, each peer in the distributed CF system is 
responsible for storing one or several buckets. Peers are connected through a DHT- 
based P2P overlay network. Peers can find their wanted buckets by their keys 
efficiently through the DHT-based routing algorithm. 

As we can see from Figure 1 and Figure 2, the implementation of our PipeCF on 
DHT-based P2P overlay network is quite straightforward except two key pieces: how 
to store the buckets and fetch them back effectively in this distributed environment. 
We solve these problems through two fundamental DHT function: put(key) and 
lookup(key) which are described in Figure 3 and Figure 4 respectively. 
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These two functions inherit from DHT the following merits: 

- Scalability: it must be designed to scale to several million nodes. 

- Efficiency: similar users should be located reasonably quick and with low overhead 
in terms of the message traffic generated. 

- Dynamicity: the system should be robust to frequent node arrivals and departures in 
order to cope with highly transient user populations’ characteristic to decentralized 
environments. 

- Balanced load: in keeping with the decentralized nature, the total resource load 
(traffic, storage, etc) should be roughly balan ced across all the nodes in the system. 




Fig. 2. System Architecture of Distributed CF Recommender System 



5 Experimental Results 

5.1 Data Set 

We use EachMovie data set [6] to evaluate the performance of improved algorithm. 
The EachMovie data set is provided by the Compaq System Research Center, which 
ran the EachMovie recommendation service for 18 months to experiment with a 
collaborative filtering algorithm. The information they gathered during that period 
consists of 72,916 users, 1,628 movies, and 2,81 1,983 numeric ratings ranging from 0 
to 5. To speed up our experiments, we only use a subset of the EachMovie data set. 



5.2 Metrics and Methodology 

The metrics for evaluating the accuracy of we used here is statistical accuracy metrics 
which evaluate the accuracy of a predictor by comparing predicted values with user- 
provided values. More specifically, we use Mean Absolute Error (MAE), a statistical 
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accuracy metrics, to report prediction experiments for it is most commonly used and 
easy to understand: 



MAE = 



Z 



aer' 



V — p 

a,t r. 



a,J 



(4) 



Where j is the rating given to item j by user a, is the predicted value of user a 
on item j, T is the test set, | T\ is the size of the test set. 



Algorithm: DHT-based CF puts a peer P’s vote vector to DHT overlay 

network 

Input: P’s vote vector 

Output: NULL 

Method: 

For each <ITEM_ID, VOTE> in P’s vote vector: 

1) P generates a unique 128-bit DHT Key (i.e. hash the system unique 
username). 

2) P hashes this <ITEM_ID, VOTE> tuple to key K, and routes it with P’s 
vote vector to the neighbor P. whose local key K. is the most similar 
with K. 

3) When P. receives the PUT message with K, it caches it. And if the most 
similar neighbor is not itself, it just routes the message to its neighbor 
whose local key is most similar with K. 

Fig. 3. DHT Putlfeey) Function 

Algorithm: lookup(key) 

Input: identifier key of the targeted bucket 

Output: targeted bucket (retrieved from other peers) 

Method: 

1) Routes the key with the targeted bucket to the neighbor P. whose local key 
K. is the most similar with K. 

2) When P. receives the LOOKUP message with K, if R has enough cached 
vote vectors with the same key K, it returns the vectors back to P, 
otherwise it just routes the message to its neighbor whose local key is 
most similar with K. Anyway, P will finally get all the records in the 
bucket whose identifier is key. 



Fig. 4. DHT Lookup(fcey) Function 

We select 2000 users and choose one user as active user per time and the 
remainder users as his candidate neighbors, because every user only make self’s 
recommendation locally. We use the mean prediction accuracy of all the 2000 users 
as the system's prediction accuracy. For every user’s recommendation calculation, our 
tests are performed using 80% of the user’s ratings for training, with the remainder for 
testing. 
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5.3 Experimental Result 

We design several experiments for evaluating our algorithm and analyze the effect of 
various factors (e.g., SR and UA, etc) by comparison. All our experiments are run on 
a Windows 2000 based PC with Intel Pentium 4 processor having a speed of 1.8 GHz 
and 512 MB of RAM. 



5.3.1 The Efficiency of Neighbor Choosing 

We used a data set of 2000 users and show among the users chosen by PipeCF 
algorithm, how many are in the top-100 users in Figure 5. We can see from the data 
that when the user number rises above 1000, more than 80 users who have the most 
similarities with the active users are chosen by PipeCF algorithm. 




Tj:al Dize of "rain of users) 

Fig. 5. How Many Users Chosen by PipeCF 
in Traditional CF’s Top 100 




5.3.2 Performance Comparison 

We compare the prediction accuracy of traditional CF algorithm and PipeCF 
algorithm while we apply both top-all and top- 100 user selection on them. The results 
are shown as Figure 6. We can see that the DHT-based algorithm has better prediction 
accuracy than the traditional CF algorithm. 

5.3.3 The Effect of Significance Refinement 

We limit the number of returned user for each bucket by 2 and 5 and do the 
experiment in Section 5.3.2 again. The user for each bucket is chosen randomly. The 
result of the number of user chosen and the prediction accuracy is shown in Figure 7 
and Figure 8 respectively. The result shows: 

- “Return All” has an 0(N) returned user number and its prediction accuracy is also 
not satisfying; 

- “Return 2” has the least returned user number but the worst prediction accuracy; 

- “Return 5” has the best prediction accuracy and the scalability is still reasonably 
well (the returned user number is still limited to a constant as the total user number 
increases). 
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Fig. 7. The Effect on Scalability of SR on 
PipeCF 




Fig. 8. The Effect on Prediction Accuracy of 
SR on PipeCF Algorithm 




Total Size of Train Setf# of users) 

Fig. 9. The Effect on Prediction Accuracy of Unanimous Amplification 



5.3.4 The Effect of Unanimous Amplification 

We adjust the weights for each user hy using Equation (5) while setting value for CX 
as 2.0, as 4.0, y as 4 and do the experiment in Section 5.3.2 again. We use the 
top- 100 and “Return All” selection method. The result shows that the UA approach 
improves the prediction accuracy of both the traditional and the PipeCF algorithm. 
From Figure 9 we can see that when UA approach is applied, the two kinds of 
algorithms have almost the same performance. 



6 Conclusion 

In this paper, we solve the scalability problem of KNN-based CF algorithm by 
proposing a novel distributed CF algorithm called PipeCF and give its implementation 
on a DHT-based P2P overlay network. Two novel approaches: significance 
refinement (SR) and unanimous amplification (UA) have been proposed to improve 
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the performance of distributed CF algorithm. The experimental data show that our 
algorithm has much better scalability than traditional KNN-based CF algorithm with 
comparable prediction efficiency. 
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Abstract. We introduce the notion of dense region as distinct and 
meaningful patterns from given data. Efficient and effective algorithms 
for identifying such regions are presented. Next, we discuss extensions of 
the algorithms for handling data streams. Finally, experiments on large- 
scale data streams such as clickstreams are given which confirm that the 
usefulness of our algorithms. 



1 Introduction 

Besides the patterns identified by traditional data mining tasks such as cluster- 
ing, classification and association rules mining, we realize that dense regions 
which are two-dimensional regions defined by subsets of entities and attributes 
whose corresponding values are mostly constant, are another type of patterns 
which are of practical use and are significant. For example, dense regions may 
be used to evaluate discriminability of a subset of attributes, thus identifying 
such regions enhances the feature selection process in classification applications. 
Indeed, such patterns are very useful when analyzing users’ behavior and im- 
proving web page or website topology design [2,3]. 

Our goals in this paper are to introduce the notion of dense regions and 
present an efficient and effective algorithm, called DRIFT (Dense Region Iden- 
tiFicaTion), for identifying such patterns. Extensions of the DRIFT algorithm 
for handling data streams are also discussed. Due to the lack of space, please 
refer to the extended version [5] for a thorough treatment on the subject. 

* The research of this author is partially supported by grants from NSF under contracts 
DMS-9973341 and ACI-0072112, ONR under contract N00014-02-1-0015 and NIH 
under contract P20 MH65166. 

** The research of this author is supported in part by Hong Kong Research Grants 
Council Grant Nos. HKU 7130/02P and HKU 7046/03P. 

^ A dense region usually refers to a subset of the space where data points are highly 
populated whereas dense regions discussed in this paper lie in the entity-by-attribute 
space. Our precise definition is given in Section 2. 
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To the best of the authors’ knowledge, dense region patterns have not been 
previously studied. A similar but not identical notion is error-tolerant frequent 
itemset introduced by Yang et al in [4] which focuses on mining association rules. 



2 Definition of Dense Regions 



Let Y be a given n-hy-p data matrix where n is the number of entities and p 
is the number of attributes of each entity. Denote by X {R, C) the submatrix of 
X defined by a subset of rows R and a subset of columns C. We also identify 
X{R, C) by the set D = R x C. 



Definition 1 (Dense regions (DRs)). A submatrix X{R,C) is called a max- 
imal dense region with respect to v, or simply a dense region with respect to v, 
if X{R,C) is a constant matrix whose entries are v (density), and, any proper 
superset of X{R,C) is a non-constant matrix (maximality). 



Example 1. Let Y be a data matrix given by the first matrix below. The 
DRs of Y with value 1 are given by the four matrices in the brace. 



/I 0 0 1\ 
110 1 
110 1 


; < 


/ 


7 1 * * * ^ 
1 * * * 

1 * * * 




/I * *1\ 
1**1 
1**1 


\1 1 2 Oyi 






yl * * * J 




y * * * * y 



f * * * * ^ 




f * * * * ^ 




11*1 




1 1 * * 




11*1 


5 


1 1 * * 


> 


y * * * * y 




1^1 1 * *yi 





Alternatively, we may denote the above DRs by {1, 2, 3, 4}x {1}, {1, 2, 3}x {1, 4}, 
{2,3} X {1,2,4} and {2,3,4} x {1,2} respectively. 



3 The DRIFT Algorithm 

The BasicDRIFT Algorithm. This starts from a given point (s, t) containing 
the target value v and returns two regions containing (s,t) where the first one 
is obtained by a vertical-first-search; the other is by a horizontal-first-search. It 
is proven in [5, Theorem 1] that the two returned regions are in fact DRs. 



Algorithm: BasicDRIFT(Y, s, t) 

Rv ^ — {1 ^ t < n\Xit = Xst\, Cv ^ — {1^1^ p\Xij = XiiXi Rv( 

Ch^{l<j< p\Xsj = Xst}, Rh<^{l<i< n\Xij = XsjVj G Ch} 
Return {R„ x Cv,Rh x Ch} 



To determine the time complexity, we suppose the two resulting DRs have 
dimensions Uy-hy-py and Uh-hy-ph respectively. The number of computa- 
tions required by the algorithm is n -I- -\- p -\- Phn. Moreover, in practice, 

ny, nh and py, ph <SC P- In this case, the complexity is of 0{n-\-p) essentially. 

The ExtendedBasicDRIFT Algorithm. The BasicDRIFT algorithm is very 
fast but it may miss some DRs. To remedy such a deficiency, we introduce the 
ExtendedBasicDRIFT algorithm which performs similarly to the BasicDRIFT 
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but after we obtain the set (respectively Ch) as in the BasicDRIFT algorithm 
starting at (s,t), we perform a horizontal (respectively vertical) search over all 
possible subsets of \ {s} (respectively Cu \ {t}) and return the maximal ones. 

This algorithm returns all DRs containing (s,t). However, since it requires 
more computations than the BasicDRIFT does, we only invoke it to find 
DRs that the BasicDRIFT misses. The question now becomes how to combine 
the two algorithms in an effective way which is the purpose of our next algorithm. 

The DRIFT Algorithm. We begin by introducing a key concept, called iso- 
lated point, which allows us to fully utilize the fast BasicDRIFT algorithm to 
find as many regions as possible while minimizes the use of the more expensive 
ExtendedBasicDRIFT algorithm. 

Definition 2 (Isolated points). A point (i,j) in a dense region D is isolated 
if it is not eontained in any other dense region. 

By Theorem 3 in [5], (s, t) is an isolated point iff the two DRs obtained from the 
BasicDRIFT are identical. Moreover, each isolated point belongs only to one DR, 
hence, after we record this region, the removal of such a point does not delete 
any legitimate DR but enhances the search for other DRs by the BasicDRIFT. 
After we remove all the isolated points recursively, the ExtendedBasicDRIFT is 
run on the reduced data matrix to find all remaining DRs. 



Algorithm: DRIFT(A, w) 

Repeat 

Start the BasicDRIFT at every point having value v 
Record all the regions found that are legitimate DRs 

Set the entries in X corresponding to the identified isolated points to be oo 
Until no further isolated point is found 

Start the ExtendedBasicDRIFT at every point in the updated X having value v 
Record all the regions found that are legitimate DRs 



We remark that, a DR is “legitimate” if it is not a subset of any previously found 
DR. Moreover, one might want to discard DRs with small size. To do so, one 
may define a DR to be “illegitimate” if its size is below a user-specified threshold 
and thus it is not inserted into the output sequence. 



4 Extending the DRIFT for Data Streams 

In applications where data are generated in the form of continuous, rapid data 
streams, such as clickstreams, an important consideration is how to update the 
changing dense regions efficiently under such a stream environment. Our data 
model is as follows. The entity- by-attribute table has its size fixed for all time 
but its entries are changing. Moreover, at each instant, only one entry in the 
table is allowed to change. 

A naive way to obtain an updated set of DRs is to simply apply the DRIFT 
algorithm to the most recent table (according to a prescribed time schedule), we 
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call this approach Snapshot Update (SSU). Obviously, such a method requires 
massive amount of computations where, very often, many of them are redundant. 

To design a better way for updating, we first observe that in data stream 
environments, only newly formed DRs are interesting. Thus, we only consider 
a change where X{s,t) is changed from not-equal-to-f to v. Moreover, only the 
DRs containing the entry (s, t) is updated and output. Such an approach greatly 
reduces the update cost and makes it more possible to be point-triggered, i.e., 
find new DRs every time when an entry is changed. 



Algorithm: Point TriggerUpdate(s, t) /* input a changed point of X */ 

Run BasicDRIFT(A, s, t) to obtain two (may be identical) new dense regions 
If (s, t) is not an isolated point 

Run ExtendedBasicDRIFT(A, s, t) to obtain the remaining DRs 

Endlf 

Return All dense regions containing the changed entry (s, t) 



5 Experimental Results 

We employ the web-log data from http://espnstar.com.cn to test the perfor- 
mance of the DRIFT and the Point TriggerUpdate (PTU) algorithms. We use 
two months’ web-log data to do the experiments. Table 1 lists the datasets for 
our experiments where ESI and ES2 are the log datasets during December, 2002. 
Each data stream contains a set of continuous access sessions during some pe- 
riod of time. Using the cube model purposed [1], we can convert original web-log 
streams into access session data streams for dense regions discovery. 



Table 1. Characteristics of the data streams ESI and ES2. 



Data stream 


# Accesses 


# Sessions 


# Visitors 


# Pages 


ESI 


78,236 


5,000 


120 


236 


ES2 


7,691,105 


669,110 


51,158 


1,609 




Fig. 1. Percentage of DRs updated by 
PTU and SSU during a 4-hour period. 
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Fig. 2. Running time (in sec.) of PTU vs, 
stream size. 
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Experiment 1. This experiment is to compare the cost of the two update 
methods discussed in Section 4. We simulate data stream updates by using the 
ESI dataset which contains continuous user accesses during a peak visiting 
period of four hours. SSU is employed only at the beginning of each hour and at 
the end of the 4th hour while PTU is performed whenever there is a new arrival 
tuple. What we compare is the percentage of DRs needed to be regenerated. 
Obviously, SSU updates all the DRs no matter whether they are changed or 
not. The experimental results suggeste that PTU is more adaptive than SSU to 
update DRs in a data stream environment, see Fig.l. On average, the update 
cost by PTU is just around 16% of that by SSU meaning that most of the 
updates by SSU are redundant and wasted. Moreover, PTU can response to 
the peak period (the 18th time slot in Fig.l) for updates while SSU has to wait 
until the 21st time slot. 

Experiment 2. The next experiment is to employ the largest dataset ES2 to 
test the scalability of the DRIFT with PTU to update the continuous arriving 
clickstream data. The experimental results show that both the searching time 
on dense regions without isolated points (C) and with isolated points (AfC) are 
acceptable, even for several millions of clickstreams per hour, see Fig. 2. 

6 Conclusion 

We demonstrate that dense regions are significant patterns which are useful in 
knowledge discovery. Efficient and effective algorithms are presented for finding 
and updating DRs. We refer the readers to [5] for theoretical treatments on 
the subject. Our experiments validate that the DRIFT algorithm is very useful 
in data stream applications such as online web usage mining. As future works, 
we would like to develop some further theoretical results characterizing dense 
regions from which more efficient algorithms may be derived and explore the use 
of dense regions in other data mining applications. 
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Abstract. Integrating or linking data from different sources is an 
iircreasingly important task iir the preprocessing stage of many data 
miiriirg projects. The aim of such linkages is to merge all records relating 
to the same entity, such as a patient or a customer. If no common unique 
entity identihers (keys) are available in all data sources, the linkage 
needs to be performed using the available identifying attributes, like 
names and addresses. Data confidentiality often limits or even prohibits 
successful data linkage, as either no consent can be gained (for example 
in biomedical studies) or the data holders are not willing to release their 
data for linkage by other parties. We present methods for confidential 
data linkage based on hash encoding, public key encryption and n-gram 
similarity comparison techniques, and show how blind data linkage can 
be performed. 

Keywords: Privacy preserving data mining, hash encoding, data 
matching, public key infrastructure, n-gram indexing. 



1 Introduction and Related Work 

The ability to find matches between confidential data items held in two (or more) 
separate databases is an increasingly common requirement for many applications 
in data processing, analysis and mining. A medical research project, for exam- 
ple, may need to determine which individuals, whose identities are recorded in 
a population-based register of people suffering from hepatitis C infection, also 
appear in a separate population-based register of cases of liver cancer. Tradition- 
ally the linkage of records in these two databases would require that identified 
data on every registered individual be copied from one of the databases to the 
other, or from both databases to a third party (often the researchers or their 
proxy) [5]. This clearly invades the privacy of the individuals concerned. It is 
typically infeasible to obtain consent for this invasion of privacy from the indi- 
viduals identified in each of the register databases - instead one or more ethics 

* Corresponding author 
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committees or institutional review boards must consent for the linkage of the 
two databases on the behalf of the individuals involved. 

However, the invasion of privacy could be avoided, or at least mitigated, 
if there were some method of determining which records in the two databases 
matched, or were likely to match on more detailed comparison, without either 
database having to reveal any identifying information to each other or to a 
third party. Researchers can then use de-identified (or anonymised) versions of 
the linked records. If the use of anonymised data is not feasible, then at worst 
only a small subset of records from each of the databases need be given to the 
researchers, in which case it may be feasible to obtain direct consent from the 
individuals concerned. 

Anonymous data linkage based on SHA [8] hashed identifiers are used in 
Switzerland [2] and France [7]. In the French system spelling transformations 
are performed before the identifiers are hashed (with an added pad to prevent 
dictionary attacks), and probabilistic linkage techniques are then applied based 
on exact matches between the transformed strings. 

2 Methods 

Alice holds a database. A, which contains one or more attributes (columns, 
variables), denoted A. a, A.b and so on, containing confidential strings (like 
names and addresses) or other character or symbol sequences. The need for 
confidentiality may arise from the fact that the values in A. a identify individuals, 
or because the information has some commercial value. Bob holds a similar but 
quite separate database, B, also containing one or more confidential columns, 
B .a, B.b and so on. 

Alice and Bob wish to determine whether any of the values in A. a match any 
of the values in B.a without revealing to each other or to any other party what 
the actual values in A. a and B.a are. The problem is simple when ’’matching” 
is defined as exact equality of the pair of strings or sequences being compared - 
comparisons can be made between secure one-way message authentication digests 
of the strings, created using algorithms such as the NIST HMAC [1] which in 
turn uses a secure one-way hashing function such as SHA [8]. However, the 
problem becomes more complicated if the strings contain errors (for example 
typographical variations in names), because even a single character difference 
between the strings will result in very different message digest values in which a 
majority of bits will be different. 

One method of overcoming this problem is to reduce the dimensionality of 
the secret strings in A. a and B.a before they are converted into a secure digest, 
using, for example, a phonetic encoding function such as Soundex [6]. However, 
Soundex and other phonetic transformations are not perfect - in particular they 
are not robust to errors in the initial character, and to truncation differences. 

Ideally, a protocol is required which permits the blind calculation by a trusted 
third party (Carol) , of a more general and robust measure of similarity between 
the pairs of secret strings. 



Blind Data Linkage Using n-gram Similarity Comparisons 



123 



2.1 n-gram Similarity Comparison of Secret Strings or Sequences 

The protocol assumes that Carol is trusted by Alice and Bob to (a) adhere to the 
protocol, (b) not reveal information to other parties except where permitted by 
the protocol, and (c) not try to determine the values of Alice’s or Bob’s source 
strings using cryptologic techniques. There is no assumption that Alice trusts 
Bob or vice versa. Note that Alice and Bob do need to share meta-data about 
the nature of the information contained in their databases with each other - in 
order to decide which columns/attributes can be validly compared - but they 
do not need to share the actual values of those columns, nor summary measures 
(such as frequency counts) derived from those values. The protocol is as follows. 



1. Alice and Bob mutually agree on a secret random key, Kab, which they 
share only with each other. This can be done using the Diffie-Hellman key 
agreement protocol [3]. They also agree on a secure one-way message au- 
thentication algorithm, khmac, such as the NIST HMAC [1] which in turn 
uses a one-way hashing function kowh (e.g. MD5 or SHA [8]). The shared 
key Kab is used to protect against ’’dictionary” attacks. Alice and Bob also 
need to agree on a protocol for preprocessing strings to render them into a 
standard form (such as converting all characters to lower case, removal or 
substitution of punctuation and whitespace, and so on). 

2. Alice computes a sorted list of digrams^ for each preprocessed (as described 
in step 1 above) value in the column A. a. For example, if a value of A. a is 
’’Peter” then the sorted list of bigrams is (”er”,”et”,”pe”,”te”). Note that 
duplicate bigrams are removed, so each bigram is unique in each list. Alice 
next calculates all possible sub-lists of all lengths greater than zero for each 
bigram list ~ in other words, the power-set of bigrams minus the empty set. 
For the example given above, Alice computes bigram sub-lists ranging from 
length 1 to 4. 



tJi tJi ptJ It; 

(”er”,”et”), (”er”,”pe”), (”er”,”te”) 
(”er”,”et”,”pe”), (”er”,”et”,”te”), (’ 
(”er”,”et”,”pe”,”te”) 



pe”), (”et”,”te”), (”pe”,”te”), 
”,”te”), (”et”,”pe”,”te”). 



If a bigram list contains b bigrams, the resulting number of sub-lists is 
2^ — 1. Alice then transforms each of the calculated bigram sub-lists into a 
secure hash digest using Kab as the key. These hashes are stored in column 
A.a_hash_bigr_comb. Alice also creates an encrypted version of the record 
identifier (key) for the string from which each value in A.a_hash_bigr_comb 
was derived - she stores this in A.encrypt_rec_key. She also places 
the length (that is, number of bigrams) of each A.a_hash_bigr_comb in 
a column called A.a_hash_bigr_comb_len, and the length (that is, the 
number of bigrams) of each original secret string in A. a, in a column 
A.aJen. Alice then sends the set of quadruplets (A.a_hash_bigr_comb, 

^ In this example, bigrams (2-grams, n = 2) are used, but the extension to trigrams 
(n = 3) and other n-grams is direct. 
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A.a hash bigr comb len, A.encrypt rec key, A.a len) to Carol. Note 
that the number of quadruplets is much larger than the number of original 
records in A.a. 

3. Bob carries out the same as in step 2 with his column B.a, and also sends 
the resulting set of quadruplets to Carol. 

4. Carol determines the set intersection of the values of A.a_hash_bigr_comb 
and B.a_hash_bigr_comb which she has been sent by Alice and Bob re- 
spectively. For each value of a_hash_bigr_comb shared by A and B, for 
each unique pairing of (A.encrypt_rec_key, B.encrypt_rec_key), Carol 
calculates a bigram score 



bigr _score 



2 • A.a_hash_bigr_comb_len 
(A.a_len + B.a_len) 



and selects the maximum bigram score value for each possible unique pairing 
of (A.encrypt_rec_key, B. encrypt _recfkey) - that is, the highest score 
for each pair of strings from A.a and B.a. Note that a bigram score of 1.0 
corresponds to an exact match between two values. 



What happens next depends on the context. Carol may report the number 
of strings with a bigram score above an agreed threshold to Alice and Bob, 
who may then negotiate further steps, or Carol may simply report the similarity 
scores and the encrypted record keys back to Alice and Bob. Alternatively, Carol 
may send this information to another third party, David, who oversees an over- 
arching blind data linkage protocol involving a number of different columns from 
Alice’s and Bob’s databases (that is, not just A.a and B.a, but also A.b and 
B.b, A.c and B.c and so on). 

Another strategy which would further reduce the risk of misuse of the infor- 
mation sent to Carol would be to have many Carols available, all functionally 
equivalent, and for Alice and Bob to decide on which of these Carols to use only 
at the very last moment. This would mean that a potential attacker would need 
to suborn or compromise a large number of the Carols in order to have a rea- 
sonable chance of gaining access to the information provided to one particular 
Carol by Alice and Bob. 



2.2 Blind Data Linkage 

The essence of modern data (or record) linkage techniques [4,9] is the indepen- 
dent comparison of a number of partial identifiers (attributes) between pairs of 
records, and the combination of the results of these comparisons into a com- 
pound or summary score (called matching weight) which is then judged against 
some criteria to classify that pair of records as a match (link), a non-match, or 
as undecided (potential match). Usually the result of the comparison between 
individual attributes is weighted in some way - in deterministic systems these 
weights are determined heuristically, whereas in probabilistic systems the weights 
are determined statistically based on the relative reliability of that attribute in 
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deciding matches and non-matches [4] , and on the relative frequency of the values 
of that attribute [9]. The classification criteria for the summary scores (match- 
ing weights) are often determined heuristically or statistically, using expectation 
maximisation (EM) techniques [10]. 

Thus, the first task is to compare each of the partially-identifying attributes 
and return a similarity score for each pair. We have demonstrated how this can 
be done blindly in the previous section. 

For each of the partially-identifying (or partially-discriminating) attributes, 
a, b, ..., i, in their databases A and B, Alice and Bob dispatch the similar- 
ity comparison task to different instances of Carol, which we will term Carola, 
Carolb, ..., Carob and so on. Each of these tasks is independent of the others, and 
should use a different shared secret key Kab- Each instance of Carol sends the 
results back to another third party (we will call this party David) which oversees 
the entire data linkage task between A and B. Thus, David accumulates a series 
of data sets containing comparison values (or similarity scores) from the Carols. 

David joins these data sets and forms a sparse matrix where each entry 
contains all the comparison values for a record pair. It is now a simple matter 
for David to multiply this matrix by a vector of weights (for each attribute), 
and then sum across each row to create a summary matching weight, which is 
compared to some criteria (thresholds) which have been determined through 
EM [10]. By these methods it is possible for David to arrive at a set of blindly 
linked records - that is, pairs of (A.encrypt_rec_key, B.encrypt_rec_key). 

3 Conclusions and Future Work 

In this paper we have presented some methods for blind ” fuzzy” linkage of records 
using hash encoding, public key encryption and n-gram similarity comparison 
techniques. Proof-of-concept implementations have demonstrated the feasibility 
of our approach, albeit at the expense of very high data transmission overheads. 

References 

1. Bellare M., Canetti R., Krawczyk H.: Message authentication using hash fnnctions 

- the HMAC construction. RSA Laboratories, CryptoBytes 1996, 2:15. 

2. Borst, F., Allaert, F.A. and Quantin, C.: The Swiss Solution for Anonymous Chain- 
ing Patient Files. MEDINFO 2001. 

3. DifHe W., Heilman M.E.: New directions in cryptography. IEEE Trans. Inform. 
Theory IT22 1976, 6:644654 

4. Fellegi, I. and Sunter, A.: A Theory for Record Linkage. Journal of the American 
Statistical Society, 1969. 

5. Kelman, C.W., Bass, A.J. and Holman, C.D.J.: Research use of linked health data 

- A best practice protocol. ANZ Journal of Public Health, 26:3, 2002. 

6. Lait, A.J. and Randell, B.: An Assessment of Name Matching Algorithms, Techni- 
cal Report, Dept, of Computing Science, University of Newcastle upon Tyne, UK 
1993. 



126 



T. Churches and P. Christen 



7. Quantin, C., Bouzelat, H., Allaert, F.A.A., Benhamiche, A.M., Faivre, J. and 
Dusserre, L.: How to ensure data quality of an epidemiological follow-up: Qual- 
ity assessment of an anonymous record linkage procedure. Inti. Journal of Medical 
Informatics, vol. 49, pp. 117-122, 1998. 

8. Schneider, B.: Applied Cryptography. John Wiley & Sons, second edition, 1996. 

9. Winkler, W.E.: The State of Record Linkage and Current Research Problems. 
RR99/03, US Bureau of the Census, 1999. 

10. Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi- 
Sunter model of record linkage. RROO/05, US Bureau of the Census, 2000. 



Condensed Representation of Emerging Patterns 



Arnaud Soulet, Bruno Cremilleux, and Frangois Rioult 

GREYC, CNRS - UMR 6072, Universite de Caen 
Campus Cote de Nacre 
F- 14032 Caen Cedex France 
{Forename . SurnamejSinf o .unicaen . fr 



Abstract. Emerging patterns (EPs) are associations of features whose 
frequencies increase significantly from one class to another. They have 
been proven useful to build powerful classifiers and to help establishing 
diagnosis. Because of the huge search space, mining and representing EPs 
is a hard task for large datasets. Thanks to the use of recent results on 
condensed representations of frequent closed patterns, we propose here 
an exact condensed representation of EPs. We also give a method to 
provide EPs with the highest growth rates, we call them strong emerging 
patterns (SEPs). In collaboration with the Philips company, experiments 
show the interests of SEPs. 



1 Introduction 

The characterization of classes and classification are significant fields of research 
in data mining and machine learning. Initially introduced in [5], emerging pat- 
terns (EPs) are patterns whose frequency strongly varies between two data sets 
(i.e., two classes). EPs characterize the classes in a quantitative and qualitative 
way. Thanks to their capacity to emphasize the distinctions between classes, 
EPs enable to build classifiers [6] or to propose a help for diagnosis. Neverthe- 
less, mining EPs in large datasets remains a challenge because of the very high 
number of candidate patterns. 

In this paper, we propose two contributions for the efficient extraction of 
emerging patterns. One originality of our approach is to take advantage of re- 
cent progresses on the condensed representations of closed patterns [9]. Firstly, 
we propose an exact condensed representation of the emerging patterns for a 
dataset. Contrary to the borders approach (Section 2) which provides the EPs 
with a lower bound of their growth rate, this condensed representation easily 
enables to know the exact growth rate for each emerging pattern. Secondly, we 
propose a method to easily provide the emerging patterns having the best growth 
rates (we call them “ strong emerging patterns ” ) . This work is also justified by 
requests from providers of data. In our collaboration with the Philips company, 
one notices in practice a high number of EPs and the strong emerging patterns 
are particularly useful to bring more synthetic and more exploitable results. 

The paper is organized in the following way. Section 2 introduces the context 
and the required notations. Section 3 provides an exact condensed representation 
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of the emerging patterns and proposes the strong emerging patterns. Finally, 
Section 4 presents the experimental evaluations. 



2 Context and Related Works 



Notations and definitions. Let T> he a dataset (Table 1), which is an excerpt of 
the data used for the search for failures iir a productioir chaiir (cf. Sectioir 4) . 



Table 1. Example of a transactional data set 



V 



Batch 


Items 




Cl ABCD 


B2 


Cl ABC 


B3 


Cl A D E 


Bi 


C2 ABC 


Bs 


C2 BC D E 


Be 


C2 B E 



Each line (or transaction) of Table 1 represents a batch (noted i?i, . . . , Bq) 
described by features (or items) : A, . . . ,E denote the advance of the batch 
within the production chaiir and Ci, C 2 the class values. T> is partitioned here 
into two datasets T>i (the right batches) and T >2 (the defective batches). The 
transactions having item Ci (resp. C 2 ) belong to T>i (resp. T> 2 ). A pattern is a 
set of items (e.g., {A, B, C}) noted by the string ABC. A transaction t contains 
the pattern X if and only ii X Qt. 

The concept of emerging patterns is related to the notion of frequency. The 
frequency of a pattern X in a dataset T> (noted J-{X,'D)) is the number of 
transactions of B which contain X. X is frequent if its frequency is at least 
the frequency threshold fixed by the user. Intuitively, an emerging pattern is a 
pattern whose frequency increases significantly from one class to another. The 
capture of contrast between classes brought by a pattern is measured by its 
growth rate. The growth rate of a pattern X from T>\Di to T>i is defined as: 



GRfiX) 



T{X,V,) 

\V,\ T{X,V) - T{X,Vi) 



( 1 ) 



Definition 1 (emerging pattern or EP). Given threshold p > 1, a pattern 
X is said to be an emerging pattern from T>\Di to T>i if GRfiX) > p. 

Let us give some examples from Table 1. With p = 2>, A and ABCD are EPs 
from T >2 to T>i. Indeed, GR\{A) = 3 and GR\{ABCD) = 00 . 
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Related Works. Efficient computation of all EPs in high dimensional datasets 
remains a challenge because the number of candidate patterns is exponential 
according to the number of items. The naive enumeration of all patterns with 
their frequencies fails quickly. In addition, the definition of EPs does not provide 
anti-monotonous constraints to apply a powerful pruning of the search space 
for methods stemming from the framework of level- wise algorithms [8]. The 
approach of handling borders, introduced by Dong and al. [5], enables to give a 
concise description of the emerging patterns. On the other hand, it requires to 
repeat for all the T>i the computation of the intervals and it does not provide for 
each EP its growth rate. Some other approaches exist like [10,7,1]. 

In the following, we focus on the condensed representation based on closed 
patterns [4] . Such an approach enables the implementation of powerful pruning 
criteria during the extraction, which improves the efficiency of algorithms [9,3]. 
A closed pattern in 27 is a maximal set of items shared by a set of transactions 
[2]. The notion of closure is linked to the one of closed pattern. The closure of 
a pattern A in 27 is h{X,T>) = f]{transaction t in T>\X C t}. An important 
property on the frequency stems from this definition. The closure of A is a 
closed pattern and T{X,T>) = iF(ft,(A, 27), 27). In our example, h{AB,T>) = 
ABC and T{AB,T>) = tF{ABC,T>). Thus, the set of the closed patterns is a 
condensed representation of all patterns because the frequency of any pattern 
can be inferred from its closure. 



3 Condensed Representation and Strong Emerging 
Patterns 

3.1 Exact Condensed Representation of Emerging Patterns 

Let us move now how to get the growth rate of any pattern A. Equation 1 
shows that it is enough to compute 1F(A, 27) and T{X,'Di). These frequencies 
can be obtained from the condensed representation of frequent closed patterns. 
Indeed, T{X,V) = T {h{X ,V) ,V) (closure property) and by the definition of 
the partial bases 27j, iF(A, 27j) = T{XCi,T>) = T{h{XCi,'D),'D). Unfortunately, 
these relations require the computation of two closures {h{X, 27) and h{XCi, 27)), 
which it is not efficient. The following properties solve this disadvantage: 

Property 1. Let A be a pattern and 27, a dataset, T{X,Vi) = T{h{X,'D),Vi) 

Proof. The properties of the closure operator ensure that for any transaction 
t, X C t ^ h{X, 27) C t. In particular, the transactions of 27j containing A are 
identical to those containing h{X, 27) and we have the equality of the frequencies. 

It is now simple to show that the growth rate of every pattern A is obtained 
thanks to the only knowledge of the growth rate of h{X, 27): 



Property 2. Let A be a pattern, we have GRi{X) = GRi{h{X,T>)) . 
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Proof. Let X be a pattern. By replacing T{X^V) with T{h{X,'D),'D) and 
J-{X,T>i) with J^{h{X,'D),T>i) in Equation 1, we immediately recognize the 
growth rate of h{X,'D). 

For instance, h{AB) = ABC and GRi{AB) = GRi{ABC) = 2. This prop- 
erty is significant because the number of closed patterns is lower (and, in general, 
much lower) than that of all patterns [4]. Thus, the frequent closed patterns with 
their growth rates are enough to synthesize the whole set of frequent EPs with 
their growth rates. We obtain an exact condensed representation of the EPs (i.e. 
the growth rate of each emerging pattern is exactly known). Let us recall that 
the borders technique (cf. Section 2) only gives a lower bound of the growth rate. 



3.2 Strong Emerging Patterns 

The number of emerging patterns of a dataset can be crippling for their use. 
In practice, it is judicious to keep only the most frequent EPs having the best 
growth rates. But thoughtlessly raising these two thresholds may be problematic. 
On the one hand, if the minimal growth rate threshold is too high, the found 
EPs tend to be too specific (i.e. too long). On the other hand, if the minimal 
frequency threshold is too high, EPs have a too low growth rate. 

We define here the strong emerging patterns which are the patterns having 
the best possible growth rates. They are a trade-off between the frequency and 
the growth rate. 

Definition 2 (strong emerging pattern). A strong emerging pattern X 
(SEP in summary) for T>i is the emerging pattern coming from a closed pat- 
tern XCi in T>i (i.e. Ci does not belong to the SEP). 

SEPs enable to highlight the following important property (due to space 
limitation the proof is not given here) . 

Property 3 (SEPs: EPs with maximum growth rate). Let X be a pattern not 
containing the item Q. Then the SEP coming from h{X^ T>i) has a better growth 
rate than X, i.e. one has GRi{X) < GRi{h{X,'Di)\{Ci}). 

Let us illustrate Property 3 on the elementary example. The pattern BC is 
not a SEP for class 1 (because h{BC,'Di)\{Ci} = ABC), its growth rate is 1 
and one has well GRi{BC) < GRi{ABC) = 2 and T{BC,Vi) = T{ABC,Vi). 

Compared to EPs, SEPs have two meaningful advantages: they are easy 
to discover from the condensed representation of frequent closed patterns (by 
simply filtering those containing Ci), and they have the best possible growth 
rates. Let us notice that the emerging patterns based on X and h{X,T>i) have 
the same frequency, thus they have the same quality according to this criterion. 
However, the strong emerging pattern coming from h{X,T>i) has a stronger (i.e. 
higher) growth rate and thus offers a better compromise between frequency and 
growth rate. 
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4 Experiments 

Experiments were carried out on a real dataset within a collaboration with the 
Philips company. The industrial aim is to identify mistaken tools in a silicon 
plate production chain. The quality test leads to three quasi-homogeneous classes 
corresponding to three quality levels. Figure 1 depicts the distributions of EPs 
according to the length of patterns for a minimal frequency threshold of 1.2%. 
The number of closed EPs (which stemmed from closed patterns) measures of 
the size of the condensed representation. Two threshold values of the minimal 
growth rate (1 and oo) are used. This figure shows that the number of EPs is 
very high compared to the number of closed EPs or SEPs. This disproportion 
does not decrease in spite of the rise of the minimal growth rate. These too large 
numbers of EPs cannot be presented to an expert for his analysis task. Other 
experiments have shown that the number of closed EPs and SEPs increases less 
quickly than the number of EPs when the frequency decreases. So, it is possible 
to examine longer and less frequent patterns. 
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Fig. 1. Comparison between the different kinds of emerging patterns 



After talks with the Philips experts, they have confirmed that the stage 
suspected by the SEPs was the real cause of the failures (an equipment was 
badly tuned) . This experiment shows the practical contribution of SEPs on real- 
world data. 



5 Conclusion 

Based on recent results in condensed representations, we have revisited the field 
of emerging patterns. We have defined an exact condensed representation of 
emerging patterns for a data base and proposed the strong emerging patterns 
which are the EPs with the highest growth rates. In addition to the simplicity of 
their extraction, this approach produces only few SEPs which are particularly 
useful for helping to diagnosis. So, it is easier to use SEPs than search relevant 
EPs among a large number of EPs. Dealing with our collaboration with the 
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Philips company, SEPs enable to successfully identify the failures of a produc- 
tion chain of silicon plates. These promising results encourage the use of strong 
emerging patterns. Further works concern the use of SEPs for classification tasks. 
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Abstract. In order to extract meaningful and hidden knowledge from 
semistructured documents such as HTML or XML files, methods for 
discovering frequent patterns or common characteristics in semistruc- 
tured documents have been more and more important. We propose new 
methods for discovering maximally frequent tree structured patterns in 
semistructured Web documents by using tag tree patterns as hypotheses. 
A tag tree pattern is an edge labeled tree which has ordered or unordered 
children and structured variables. An edge label is a tag or a keyword 
in such Web documents, and a variable can match an arbitrary subtree, 
which represents a field of a semistructured document. As a special case, 
a contractible variable can match an empty subtree, which represents a 
missing field in a semistructured document. Since semistructured docu- 
ments have irregularities such as missing helds, a tag tree pattern with 
contractible variables is suited for representing tree structured patterns 
in such semistructured documents. First, we present an algorithm for 
generating all maximally frequent ordered tag tree patterns with con- 
tractible variables. Second, we give an algorithm for generating all max- 
imally frequent unordered tag tree patterns with contractible variables. 



1 Introduction 

Data model: We use rooted trees as representations of semistructured data 
such as HTML or XML files, according to Object Exchange Model [1]. In this 
paper, “ordered” means “with ordered children” and “unordered” means “with 
unordered children”. We consider both ordered trees and unordered trees in 
order to deal with various semistructured data. 

Our approach: To formulate a schema on such tree structured data we have 
proposed a tag tree pattern [5,6,7] (Sec. 2). A tag tree pattern is an edge la- 
beled tree which has ordered or unordered children and structured variables. 
An edge label is a tag or a keyword in Web documents, or a wildcard for any 
string. A variable can match an arbitrary subtree, which represents a field of a 
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semistructured document. As a special case, a contractible variable can match 
an empty subtree, which represents a missing field in a semistructured docu- 
ment. Since semistructured data have irregularities such as missing fields, a tag 
tree pattern with contractible variable is suited for representing tree structured 
patterns in such semistructured documents. In the examples of tree structured 
data and tag tree patterns (Fig.l), the variable with label “x” of the tag tree 
pattern t 2 matches the subtree gi and the contractible variable with label “y”of 
t 2 matches the empty tree 52- 

Novelty of our approach: Graph or tree-based data mining and discovery 
of frequent structures in graph or tree structured data have been extensively 
studied [2,3,11,12]. Our target of discovery is neither a simply frequent pattern 
nor a maximally frequent pattern with respect to syntactic sizes of patterns such 
as the number of vertices. In order to apply our method to information extraction 
from heterogeneous semistructured Web documents, our target of discovery is a 
semantically and maximally frequent tag tree pattern (Sec. 3) which represents a 
common characteristic in semistructured documents. “Semantically” means that 
the maximality is described by the descriptive power (or language in Sec. 2) of 
tree structured patterns. 

Data mining problems: In this paper, we consider the following data min- 
ing problems. MFOTTP (resp. MFUTTP) (Sec. 3) is a problem to gener- 
ate all Maximally Frequent Ordered (resp. Unordered) Tag Tree Patterns with 
frequency above a user-specified threshold from a given set of ordered (resp. 
unordered) semistructured data. Consider the examples in Fig. 1. For a set 
of semistructured data {Ti,T2,T3}, the tag tree pattern t 2 is a maximally |- 
frequent ordered tag tree pattern. In fact, O explains T 2 and Ts, but O does 
not explain T\. The tag tree pattern t\ also explains T 2 and T3. But t\ explains 
any tree with two or more vertices and t\ is overgeneralized and meaningless. So 
semantic maximality of desired tag tree patterns is important. 

Main results: In Sec. 4, we present an algorithm GEN-MFOTTP for generat- 
ing all maximally frequent ordered tag tree patterns with contractible variables. 
In Sec. 5, we give an algorithm GEN-MFUTTP for generating all maximally 
frequent unordered tag tree patterns with contractible variables. 

Related works: A tag tree pattern is different from other representations 

of tree structured patterns such as in [2,3,11] in that a tag tree pattern has 
structured variables which can match arbitrary trees and a tag tree pattern 
represents not a substructure but a whole tree structure. As for our previous 
works, we proposed a method for generating all maximally frequent unordered 
(resp. ordered) tag tree patterns without contractible variables [5] (resp. [7]) 
by using an algorithm for generating unordered (resp. ordered) trees [4] (resp. 
[9]) and our algorithm for maximality test. Also we gave a polynomial time 
algorithm for finding one of the least generalized ordered tag tree patterns with 
contractible variables [6]. Our algorithms in this paper use polynomial time 
matching algorithms for ordered and unordered term trees with contractible 
variables [10] to compute the frequency of tag tree patterns. 
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Fig. 1. Tag tree patterns ti, t 2 , and trees Ti, T 2 , T 3 , gi, g 2 - An uncontractible (resp. 
contractible) variable is represented by a single (resp. double) lined box with lines to 
its elements. The label inside a box is the variable label of the variable. 



2 Preliminaries 

Definition 1. (Ordered term trees and unordered term trees) Let T = 

(Vt,Ex) be a rooted tree with ordered children or unordered children, which 
has a set Vt of vertices and a set Et of edges. We call a rooted tree with ordered 
(resp. unordered) children an ordered tree (resp. an unordered tree). Let 
Eg and Hg be a partition of Et, i.e., EgU Hg = Et and Eg(l Hg = 0. And let 
Vg = Vt- A triplet g = {Vg, Eg, Hg) is called an ordered term tree if T is an 
ordered tree, and called an unordered term tree if T is an unordered tree. We 
call an element in Vg, Eg and Hg a vertex, an edge and a variable, respectively. 

Below we say a term tree or a tag tree pattern if we do not have to dis- 
tinguish between “ordered” and “unordered” ones. We assume that all variable 
labels in a term tree are different. A and X denote a set of edge labels and a set 
of variable labels, respectively, where A D X = (f>. We use a notation [ti,?;'] to 
represent a variable {v,v'} € Hg such that v is the parent of v' . Then we call v 
the parent port of [v,v'\ and v' the child port of 

Let be a distinguished subset of X. We call variable labels in X'^ con- 
tractible variable labels. A contractible variable label can be attached to a vari- 
able whose child port is a leaf. We call a variable with a contractible variable 
label a contractible variable, which is allowed to substitute a tree with a sin- 
gleton vertex. We state the formal definition later. We call a variable which is 
not a contractible variable an uncontractible variable. In order to distinguish 
contractible variables from uncontractible variables, we denote by [v,v'Y (resp. 
[v,v']'^) a contractible variable (resp. an uncontractible variable). 

For an ordered term tree g, all children of every internal vertex u in g have a 
total ordering on all children of u. The ordering on the children of u of an ordered 
term tree g is denoted by . Let / = {Vf, Ef,Hj) and g = {Vg, Eg,Hg) be two 
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ordered (resp. imordered) term trees. We say that / and g are isomorphic, if there 
is a bijection (p from Vf to Vg which satisfies the following conditions (l)-(4) 
(resp. (l)-(3)): (1) The root of / is mapped to the root of g by (p. (2) {u, v} £ Ef 
if and only if {(p{u) , ip{v)} £ Eg and the two edges have the same edge label. 
(3) [u,v] £ E[f if and only if [(p{u) , ip(v)] £ Hg, in particular, [u,vY £ Hf if and 
only if [lp(u) , ip{v)Y £ Elg. (4) If / and g are ordered term trees, for any internal 
vertex u in f which has more than one child, and for any two children u' and u" 
of u, u' <l u" if and only if p>{u') ^W)- 

Let 5 be a term tree and a; be a variable label in X. Let cr = [u,u'\ be a 
list of two vertices in g where u is the root of g and u' is a leaf of g. The form 
X := [g, cr] is called a binding for x. If x is a contractible variable label in 
g may be a tree with a singleton vertex u and thus cr = [u, u] . It is the only 
case that a tree with a singleton vertex is allowed for a binding. Let / and g 
be two ordered (resp. imordered) term trees. A new ordered (resp. unordered) 
term tree /{x := [ 5 ,cr]} is obtained by applying the binding x := [g,a] to / 
in the following way. Let e = [v, v'] be a variable in / with the variable label 
X. Let g' be one copy of g and w,w' the vertices of g' corresponding to u,u' of 
g, respectively. For the variable e = [v,v'], we attach g' to / by removing the 
variable e from Elf and by identifying the vertices v, v' with the vertices w, w' of 
g' , respectively. If g is a tree with a singleton vertex, i.e., u = u' , then v becomes 
identical to v' after the binding. A substitution 0 is a finite collection of bindings 
{xi := [gi, iJi], • • • , x„ := [ 5 „,ct„]}, where Xi’s are mutually distinct variable 
labels in X and gYs are term trees. The term tree f9, called the instance of / by 
0, is obtained by applying the all bindings Xi := [gi,o'i] on / simultaneously. We 
define the root of the resulting term tree f6 as the root of /. Further we have 
to give a new total ordering <(,^ on every vertex v of f9. These orderings are 
defined in a natural way. Consider the examples ( 71 , 32,^2 and T 3 in Fig. 1. Let 
0 = {x := [gi,[ui,vi]],y := [( 72 , [w 2 j ^ 2 ]]} be a substitution. Then the instance 
t20 of the term tree t 2 by 9 is the tree T^. 

Definition 2. Let Arag and Akw be two languages which consist of infinitely 
or finitely many words where Arag FI Akw = 0. Let A = Axag U Akw- We call 
a word in Axag a tag and a word in Akw a keyword. An ordered (resp. 
unordered) tag tree pattern is an ordered (resp. unordered) term tree such 
that each edge label on it is any of a tag, a keyword, and a special symbol “?”. 
Let A? be a subset of A. The symbol “?” is a wildcard for any word in A?. A tag 
tree pattern with no variable is called a ground tag tree pattern. 

For an edge {v,v'} of a tag tree pattern and an edge {u,u'} of a tree, we 
say that matches {m, u'} if the following conditions (l)-(3) hold: (1) If 

the edge label of {v,v'} is a tag, then the edge label of {u,u'} is the same tag 
or another tag which is considered to be identical with the tag on {m,w'}. (2) If 
the edge label of {i:, i:'} is a keyword, then the edge label of {u, m'} is a keyword 
and the label of {?;, v'} appears as a substring in the edge label of {u, u'}. (3) If 
the edge label of is “?”, then the edge label of {u,u'} is in A?. A ground 

ordered (resp. unordered) tag tree pattern tt = {Vt^,Et^,%) matches an ordered 
(resp. unordered) tree T = {Vr, Et) if there exists a bijection ip from Vt^ to Vr 
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which satisfies the following conditions (l)-(4) (resp. (l)-(3)): (1) The root of tt is 
mapped to the root of T by ip. (2) {n, v'} € if and only if {ip{v),ip{v')} G Et- 
(3) For all {r;, v'} G E^^, {v, v'} matches {<p{v), ip{v')}. (4) If tt and T are ordered, 
for any internal vertex rt in tt which has more than one child, and for any two 
children u' and u" of u, u' u" if and only if ip{u') <J(„) A tag tree 

pattern tt matches a tree T if there exists a substitution 0 such that 7t0 is a 
ground tag tree pattern and 7 t 0 matches T. 

or A (resp. UT a) denotes the set of all ordered (resp. unordered) trees whose 
edge labels are in A. OTTVa (resp. UTTVa) denotes the set of all ordered 
(resp. unordered) tag tree patterns with contractible and uncontractible vari- 
ables whose tags and keywords are in A. For tt in OTTP'a (resp. IJTTP^}^^ the 
language La{t^) is defined as {a tree T in OT /i(resp. l/T a) \ tt matches T}. 



3 Data Mining Problems 

Data mining setting: A set of ordered (resp. unordered) semistructured data 

V = {Ti, T 2 , . . . , r^} is a set of ordered (resp. unordered) trees. Let Ad be the 
set of all edge labels of trees in T>. The matching count of an ordered (resp. 
unordered) tag tree pattern tt w.r.t. T>, denoted by matchD{Tr), is the number 
of ordered (resp. unordered) trees Ti G V {1 < i < m) such that tt matches Ti. 
Then the frequency of tt w.r.t. T> is defined by suppd{t^) = matchDiT^) / rn. Let cr 
be a real number where 0 < cr < 1. A tag tree pattern tt is cr- frequent w.r.t. V if 
suppd{t^) > cr. Let TT denotes OTTP\ orl/TTV^X, and N C ATagAApcwA{7}. We 
denote by n{A') the set of all tag tree patterns tt G II such that all edge labels of 
TT are in N . Let Tag he a finite subset of ATag and KW a finite subset of Akw- 
An ordered (resp. unordered) tag tree pattern tt in OTTVA{Tag U KW U {?}) 
(resp. UTrP\{Tag U KW U{?})) is maximally cr-frequent w.r.t. T> if (1) tt is 
cr-frequent, and (2) if La{tt') La{tt) then tt' is not cr-frequent for any tag tree 
pattern tt' in OTTPA{Tag U KW U {?}) (resp. UTTP\{Tag U KW U {?})). 

All Maximally Frequent Ordered Tag Tree Patterns (MFOTTP) 

Input: A set of ordered semistructured data V, a threshold 0 < cr < 1, and two 
finite sets of edge labels Tag and KW . 

Assumption: Ad ^ A? C A. 

Problem: Generate all maximally a-frequent ordered tag tree patterns w.r.t. T> 
in OTTPAiTag U KW U {?}). 

All Maximally Frequent Unordered Tag Tree Patterns (MFUTTP) 

Input: A set of unordered semistructured data T>, a threshold 0 < cr < 1, and 
two finite sets of edge labels Tag and KW. 

Assumption: Ad ^ A? ^ A, and the cardinality of both A — A? and A? — Ad 
is infinite. 

Problem: Generate all maximally cr-frequent unordered tag tree patterns w.r.t. 

V in UTrPA(.TagUKWU{7}). 
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4 Generating All Maximally Frequent Ordered Tag Tree 
Patterns 

We give an algorithm Gen-Mfottp which generates all maximally tr-frequent 
ordered tag tree patterns. Let T> be an input set of ordered semistructured data. 
In the following procedure Sub-Mfottp, we use the algorithm for generating all 
ordered trees [2]. A tag tree tree pattern with only uncontractible variables is a tag 
tree pattern consisting of only vertices and uncontractible variables. We regard 
an ordered tag tree pattern with only uncontractible variables as an ordered 
tree with the same tree structure. By using the same parent-child relation as 
in [2], we can enumerate without any duplicate all ordered tag tree patterns 
with only uncontractible variables in a way of depth first search from general to 
specific and backtracking. Although the semantics of matching of tree structured 
patterns and tree structured data is different from that in [2], a parent pattern 
is more general than its child patterns in the generating process of ordered tag 
tree tree patterns with only uncontractible variables. 

Algorithm Gen-Mfottp; 

begin 

7T(cr) := 0; tt := ({w, w'}, 0, Sub-Mfottp(7t); return fT(cr); 



Procedure SuB-MFOTTP(Tr); 

begin 

if 7T is not cr-frequent w.r.t. V then return else Basic-Mfottp(7t); 
foreach child tag tree pattern tt' of tt do SuB-MFOTTP(Tr'); 

end; 

Procedure Basic-Mfottp(7t) 

begin 

Step 1. Generate cr-frequent tag tree patterns: 

Let = {/ii, . . . , /ife} be the variable set of tt. We perform procedure 
Substitution-Ot(7t, hi,k). 

Step 2. Eliminate redundancy: 

For each tt £ n{cr), if there exists a pair of contractible variables [u,vY and 
[u,v'Y such that v' is the immediately right sibling of v, then we remove tt 
from n{a). 

Step 3. Maximality testl: 



For each tt £ n{a), if there exists an uncontractible (resp. contractible) vari- 
able a; in TT such that iiOxix) is cr-frequent w.r.t. V for any X £ {A, B, C, D} 



end. 



Ox{x) = {x := [Tx, [Rx, Rx]]} 
X £ {A,B,C,D,E,F} 



© 

rri rri rj~\ rri rri rri 

-tA J^B tC J^F 
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(resp. X S {E,F}), then tt is not maximally cr-frequent w.r.t. T>, and we 
remove tt from II {a). 

Step 4- Maximality test2: 

If there exists an edge with “? ” in tt such that a tag tree pattern obtained 
from TT by replacing the edge with an edge which has a label in TagU KW 
is cr-frequent w.r.t T>, then tt is not maximally cr-frequent w.r.t. T>, and we 
remove tt from II (a). 

end; 



Procedure Substitution-Ot(7t, h^, fe); 

begin 

if i = fc -I- 1 then begin 7T(cr) := i7(cr) U {tt}; return end; 

If the child port of is not a leaf then Substitution-Ot(7t, hi+i, k); 
VARIABLE-REPLACING-OT(7r, hi, fc) (Fig. 2); 

return; 

end; 



For ordered (resp. unordered) tag tree patterns g = (V, E, H) and g' = 
(P', if', if'), we say that g' is an ordered (resp. unordered) tag subtree pattern of 
g if P' C V, E' C E, and H' C H. For an ordered (resp. unordered) tag tree 
pattern tt' in Fig. 3 (resp. Fig. 5), an occurrence of tt' in g is an ordered (resp. 
unordered) tag subtree pattern of g which is isomorphic to tt' . The digit in a box 
(resp. 



> k 



near u shows that the number of children of u, which connect 
to u with edges or uncontractible variables, is equal to k (resp. is more than or 
equal to k) (Fig. 3, 5). 



Lemma 1. Let and tt' (1 < z < 4) be tag tree patterns described in Fig. 3. 
Let tt' be an ordered tag tree pattern which has at least one occurrence of tt' 
(1 < z < 4). For one of occurrences o/ tt', we make a new ordered tag tree 
pattern tt by replacing the occurrence o/tt' with tt^. Then Lj\{tt) = L/^^-k'). 

An ordered tag tree pattern tt is said to be a canonical ordered tag tree 
pattern if tt has no occurrence of tt' (1 < z < 4) (Fig. 3). Any ordered tag 
tree pattern tt is transformed into the canonical ordered tag tree pattern by 
replacing all occurrences of tt' with (1 < z < 4) repeatedly. We denote by 
c(7t) the canonical ordered tag tree pattern transformed from tt. We note that 
La(c(7t)) = La(7t). 



Lemma 2. Let A = Aj-ag U Ls.t tt = (Vt^, E^^, be an input tag tree 

pattern which is decided to be maximally a-frequent w.r.t. T> by Gen-Mfottp. 
If there is a tag tree pattern tt' = (F,r', ih^/) which is cr-frequent w.r.t. D 

and La{tt') Q La{tt), then c(tt) = c(7t'). 



Theorem 1. Algorithm Gen-Mfottp generates all maximally a-frequent or- 
dered tag tree patterns in canonical form. 
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I* Let el{x) = {x ■- [tl, [Ri,Li]]} (1 < i < 8), 6^{x) = {x := [t^, [Rj,Lj]]} (1 < j < 
8, X £ TagU KW). The tag tree pattern at the end of each arrow is more specific than 
that at the origin of the arrow. For example, if 'k01{x) is not cr-frequent w.r.t. T>, then 
none of x9i(x), TrO^ix), Tr9l(x), and ndllx) are cr-frequent w.r.t. V. * j 
Procedure Variable- Replacing-Ot(7t, hi, k)\ 
begin 

Let X be the variable label of hi', 

If the child port of hi is a leaf then Q := {Ti9i{x)} else Q := {•n9^{x)}\ 

while <5 7 ^ 0 do begin 

Choose one tag tree pattern Tr9f{x) (o G Tag U KW U {?}) from Q; 

Q ~ Q - {n9t{x)}- 

If Tr9i(x) is (7-frequent w.r.t. T? then begin 
SuBSTlTUTlON-OT(-7r0“(a:), hi+i, k); 

Add all tag tree patterns Tv9j{x) (b G Tag U KW U {?}) to Q 

s.t. tj are at the ends of the arrows from described in the upper figure. 

end 

end 

end; 



Fig. 2. Procedure Variable- Replacing-Ot(7t, hi, fc). 
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7T4 7T4 



Fig. 3. For i = 1,2, 3, 4, = I/yi(7r'). An arrow shows that the right vertex of it 

is the immediately right sibling of the left vertex. 



5 Generating All Maximally Frequent Unordered Tag 
Tree Patterns 

We give an algorithm Gen-Mfuttp which generates all maximally tr-frequent 
unordered tag tree patterns. Let T> be an input set of unordered semistructured 
data. The descriptions of Gen-Mfuttp, Sub-Mfuttp and Substitution-Ut 
are obtained from Gen-Mfottp,Sub-Mfottp and Substitution-Ot, respec- 
tively, by replacing “Mfottp” and “Ot” with “Mfuttp” and “Ut”, respec- 
tively. In Sub-Mfuttp, we use the algorithm for generating all unordered trees 
and the parent-child relation in [3,8] in order to implement the depth first search 
from general to specific and backtracking. 

Procedure Basic-Mfuttp(7t) 
begin 

Step 1. Generate cr- frequent tag tree patterns: 

Let = {/ii, . . . , /ife} be the variable set of tt. We perform procedure 
Substitution-Ut(7t, hi,k). 

Step 2. Eliminate redundancy: 

For each tt € LI((t), if there exists a vertex u having two or more contractible 
variables such that the parent port of it is u, then we remove tt from n(a). 

Step 3. Maximality testl: 



dx{x) = {x := [Tx, [Rx,Lx]]} 

X e {A,B,C,D,E} 

rri rri rri rri rri 

tA tB tc Id tE 

For each tt € Lf(cr), if there exists an uncontractible (resp. contractible) 
variable xuttt such that tt9x{x) is cr-frequent w.r.t. T> for any X £ {A, B, C} 
(resp. X £ {!?, E}), then tt is not maximally cr-frequent w.r.t. T>, and we 
remove tt from n{a). 

Step 4- Maximality test2: 

If there exists an edge with “? ” in tt such that a tag tree pattern obtained 
from TT by replacing the edge with an edge which has a label in TagU KW 
is cr-frequent w.r.t T>, then tt is not maximally cr-frequent w.r.t. T>, and we 
remove tt from n{a). 
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/* Let e]{x) = {x := [t], [Ri,Li]]} (1 < i < 4), 9^{x) = {x := [t^ , [Rj,Lj]]} (1 < j < 
4,A G TagUKW). *j 

Procedure Variable- Replacing-Ut(7t, hi, k)\ 

begin 

Let X be the variable label of hi\ 

If the child port of hi is a leaf then Q := {n9{{x)} else Q := { 7 t 03 (x)}; 

while <5 7 ^ 0 do begin 

Choose one tag tree pattern 7r0“(x) (a G Tag U KW U {?}) from Q\ 

Q ~ Q - {n9‘^{x)]- 

If 1x91 (x) is cr-frequent w.r.t. T> then begin 
Substitution-Ut(7t0“(x), /li+i, fc); 

Add all tag tree patterns iv9j{x) (6 G Tag U KW U {?}) to Q 
s.t. are at the ends of the arrows from described in the upper figure, 
end 
end 
end; 



Fig. 4. Procedure Variable- Replacing-Ut(7t, hi,k). 



end; 

Lemma 3. Let iXi and tt' (1 < j < 3) he unordered tag tree patterns in Fig. 5. 
Let it' be an unordered tag tree pattern which has at least one occurrence of tt' 
(1 < i < 3). For one of occurrences of tt[, we make a new unordered tag tree 
pattern it by replacing the occurrence of tt[ with Wi. Then La(7t) = La{t^'). 

An unordered tag tree pattern tt is said to be a canonical unordered tag tree 
pattern if tt has no occurrence of tt' (1 < i < 3) (Fig. 5). Any unordered tag 
tree pattern tt is transformed into the canonical unordered tag tree pattern by 
replacing all occurrences of tt' with tt^ (1 < i < 3) repeatedly. We denote by c(7t) 
the canonical unordered tag tree pattern transformed from tt. 
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Fig. 5. For i = 1, 2, 3 LAij’^i) = LAij^'i)- Let u be a vertex of and vi, . . . ,Vk children 
of u. We suppose that at least one child among vi,...,Vk is connected to u by a 
contractible variable or an uncontractible variable. 




Exp. 1 



Exp. 2 



Fig. 6. Experimental results. 



Lemma 4. Let A = A^ag U Axw- Let tt = (T^-, be an input tag tree 

pattern which is decided to he maximally a -frequent w.r.t. T > by Gen-Mfuttp. 
If there is a tag tree pattern tt' = (F;r'j -E'-n-'i which is a-frequent w.r.t. T> 
and then c{tt) = 



Theorem 2. Algorithm Gen-Mfuttp generates all maximally a-frequent un- 
ordered tag tree patterns in canonical form. 

6 Implementation and Experimental Results 

In order to evaluate the performance of the process of searching cr-frequent or- 
dered (resp. unordered) tag tree tree patterns with only uncontractible variables 
in our algorithms, we have two types of experiments of generating all such pat- 
terns by the previous implementation (“previous impl.”) and the present imple- 
mentation (“present impl.”). The previous implementation [7] (resp. [5]) cannot 
prune the search space in the process. The present implementation prunes the 
search space by modifying Gen-Mfottp (resp. Gen-Mfuttp). The implemen- 
tation is by GCL2.2 and on a Sun workstation Ultra-10 clock 333MHz. The 
sample file is converted from a sample XML file about garment sales data. The 
sample file consists of 172 tree structured data. The maximum number of ver- 
tices over all trees in the file is 11. We can set the maximum number (“max 
of vertices in ordered (resp. unordered) TTPs”) of vertices of ordered (resp. 
unordered) tag tree patterns in the hypothesis space. Exp.l (resp. Exp. 2) gives 
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the consumed run time (sec) by the two implementations in case of ordered 
(resp. unordered) tag tree patterns for the specified minimum frequency =0.1 
and varied maximum numbers of vertices of ordered (resp. unordered) tag tree 
patterns in the hypothesis spaces. These experiments show that the pruning of 
the search space in the above process is effective. 

7 Conclusions 

We have studied knowledge discovery from semistructured Web documents such 
as HTML/XML files. We have presented algorithms for generating all maxi- 
mally frequent ordered and unordered tag tree patterns with contractible vari- 
ables. This work is partly supported by Grant-in- Aid for Scientific Research (C) 
No. 13680459 from Japan Society for the Promotion of Science and Grant for 
Special Academic Research No. 2101 from Hiroshima Gity University. 
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Abstract. As the Web has become an important channel of information floods, 
users have had difficulty on identifying what they really want from huge 
amounts of rubbish-like information provided by the Web search engines when 
utilizing the Web's low-cost information. This is because most users can only 
give inadequate (or incomplete) expressions for representing their requirements 
when querying the Web. In this paper, a heuristic model is proposed for 
tackling the inadequate query problem. Our approach is based on the potentially 
useful relationships among terms, called term association rules, in text corpus. 
For identifying quality information, a constraint is designed for capturing the 
goodness of queries. The heuristic information in our model assists users in 
expressing their queries desired. 



1 Introduction 

User queries to the Web or other information systems are commonly described by 
using one or more terms as keywords to retrieve information. Some queries might be 
appropriately given by experienced and knowledgeable users, while others might not 
be good enough to ensure that those returned results are what the users want. Some 
users consider that Boolean logic statements are too complicated to be used. Usually, 
users are not experts in the area in which the information is searched. Therefore, they 
might lack of the domain-specific vocabulary and the author's preferences of terms 
used to build the information system. They consequently start searching with generic 
words to describe the information to be searched for. Sometimes, users are even 
unsure of what they exactly need in the retrieval. All of these reasons then often lead 
to uses of incomplete and inaccurate terms for searching. Thus, an information 
retrieval system should provide tools to automatically help users to develop their 
search descriptions that match both the need of the user and the writing style of the 
authors. 

One of the approaches to provide the service is the automatic expansion of the queries 
with some additional terms [1, 2]. These expanded terms for a given query should be 
semantically or statistically associated with terms in the original query. An expansion 
method may be using local or global knowledge, i.e. look at the answers to a query or 
at the entire corpus. An expansion method may be user-interactive or be fully 
automatic. 



H. Dai. R. Srikant. and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 145-154, 2004. 
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Global methods use the entire corpus or external knowledge to expand the query. The 
knowledge used comes in the form of a thesaurus. A thesaurus is a data structure 
recording associations between terms. Man-made Thesauri such as WordNet [13], 
whether they are general purpose of domain specific, records linguistic or conceptual 
associations between the terms as identified by human experts. The expansion is 
automatic but applies to both the query and the documents. 

Some researchers have recently attempted to automatically mine the associations for 
query expansion from the corpus. Most approaches are based on clustering of terms in 
the document space. Intuitively, clustering captures synonymous associations. In [Lin 
1998] the authors present a global analysis approach based on association rules 
mining. However, their objective is not to find associations for query expansion but 
rather to construct a classification of the documents. Moreover, techniques of 
association rule mining [9, 10] are frequently used for text mining [3, 5, 6, 8] and 
global query expansion [4, 7]. 

Local methods are also known as relevance feedback methods. The local set of 
documents is the set retrieved with an initial unexpanded query. Local methods use 
the local set to discover the additional candidate terms to be added to the query. In 
practice local set is kept arbitrarily small compared to the total size of the corpus. The 
user can indicate manually which documents in the local set are relevant. This step is 
called relevance feedback. The frequent words in the relevant documents can be used 
to expand the initial query and/or re-weight the query terms by means of a formula or 
algorithm, for example the Rocchio algorithm. The quality of relevance feedback is 
the key point of local methods. But not all users like the fussy interactive procedure, 
and the current algorithms are relatively complicated to common users. Pseudo 
relevance feedback method just assumes top-ranked documents as the relevance 
feedback, but it didn’t guarantee the quality of relevance feedback. 

In this paper, an association analysis approach is proposed for combining the Global 
and Local information by maintaining a dynamic association rule set for query 
expansion. It aims to better understanding user queries so as to automatically generate 
query constructions. In Section 2, the term association analysis and the rule 
maintenance model are introduced. In Section 3, the structure of our system and 
different heuristic information used in system are depicted. In Section 4, performance 
of our method is evaluated based on our experiments. 



2 Related Works 

2.1 Term Association Rules 

Constructing queries with term association rules is to add words to queries from the 
rules that have query terms in one side with qualified confidence and support. 
Generally, term association rules are of the form: 
support = s, confidence = c 

s = s{tj t^, c = s{tj t^ysitj) 

where and are terms, s(t, is the support of and and s(f^ ) is the support of 
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A rule with a high confidence indicates that term t, often occurs in a document where 
term tj occurs. A rule with high support indicates that many examples were found to 
suggest the rule. 

The scope of our investigation is defined by the variation of the range of the two 
parameters and by the options in using rules of one or more of the forms below given 
for the query term “nuclear”: 

“nuclear” — »■ t s, c ; 
t — > “nuclear” s, c; 

t <->■ “nuclear”, i.e. t — »■ “nuclear” and “nuclear” — > t, s, cl,c2 

Of course such rules exist as soon as a term t appears in a document where the term 
“nuclear” appears. We use the confidence and the support of the rules, indicator of 
their quality, to select some of them only. We take word “nuclear” in Topic 202 for 
example to see how it is used. 

Example 1: 

Nuclear — > soviet, (supp = 0.016, conf = 0.4872) 

Nuclear — »• U.S., (supp = 0.0187, conf = 0.5688) 

Plutonium — > nuclear, (supp = 0.0017, conf = 0.8993) 

Reactor — » nuclear, (supp = 0.0039,conf = 0.971 1) 

Weapon nuclear, (supp = 0.0171, confl = 0.2825, conf2 = 0.5194) 

Three kinds of rules are as above. We only consider type 1 rules in this paper because 
user in the interactive procedure confirms the condition words of a query. 

2.2 Is There Natural Semantics behind the Rules? 

A high confidence rule of the form tj — > t^ indicates that (often) t^ appears in a 
document if t, appears. This suggests hyper/hyponym or holo/meronym types of 
relations between the terms, t^ is equally or more general than t, and conversely. 
Examples mined from our corpus are “Kohl” is a “Chancelor”, and “soybean”, 
“corn”, and “wheat” are kinds of “grain”. The relations found characterize what we 
could call a contextual holonony: if tj ^ t^, then tj is part of the vocabulary in the 
topical context suggested by the concept denoted by t^. Eor example we found such 
associations between the names of “Mandela” and “De Klerk” with the term 
“Apartheid”. 

A rule tl O t2, i.e. tl — > t2 and t2 — > tl with high and similar respective confidence 
as well as a high support indicates that tl and t2 tend to appear together. This suggests 
a contextual synonymy. An example is the mined association between “conviction” 
and “sentence”. 

Many associations were also mined between nouns and adjectives not handled by the 
stemming algorithm such as “Jew” and “Jewish”, “Japan” and “Japanese”. There are 
also many such associations associating the first and last names of personalities: 
“Saddam” and “Hussein”, “Mikail” and “Gorbachev” (provided the first and last 
name or not ambiguous in the corpus). Of course the synonymy is not proper. The 
association indicates the similarity of the terms in the topical context such as the 
association found between the two terms “crisis” and “Gulf’ in the 1990 corpus. 
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Although we have not conducted a formal evaluation of the results obtained, we 
notice that our subjective evaluation leads to similar conclusion than those presented 
and substantiated in [4]: it is possible to obtain a topically meaningful thesaurus from 
the statistical analysis of a corpus. In our case the meaning of the association 
discovered can be related to the form and parameters of the association rules. A 
formal evaluation of the semantic quality of the rules could be conducted for instance 
with a protocol to ask users familiarized with the corpus or the corpus subject matter 
to assess the rules. 



2.3 Maintenance of Association Rules by Weighting 

In many applications, the databases are dynamic, i.e., transactions are continuously 
being added. In particular, we observe that recently added transactions can be more 
'interesting' than those inserted long ago in the sense that they reflect more closely the 
current buying trends of customers. 

For example, in a toy departmental store, the store will be flooded with purchases of 
Tarzan and his friends in the period just before, during and immediately after the 
movie Tarzan was aired in the cinema. From the departmental store's point of view, it 
will want to capture the sales of these toys accurately to facilitate ordering and sales. 
On the other hand, some previously popular items in the database may become less 
popular with very few transactions in the recently added dataset. For example, if the 
movie Mulan was aired before Tarzan, it is expected that its sales will drop once the 
movie ended and Tarzan was aired. 

Our query construction model is based on the association rules. The association in 
feedback must be involved in the rule set. We will give simple introduce on the 
maintenance model which is based on our previous work. Following Figure 1 is the 
illustration graph of the rule maintenance model. You can check more detail on [10] 



3 Methodology and System Architecture 

3.1 System Structure 

Figure 2 illustrates the system structure that we design for query expansion. 

The system consists of four main parts. Interactive Interface, association rule 
Maintenance module, query constructor, and query processor. 

he Interactive Construction Interface shows the heuristic data and last round results. 
Users can view the heuristic data and take part in the query construction interactively. 
On the other hand, the results of every round retrieve are also shown here. Users can 
click on the relevant results or modified the queries. The Interface can gather those 
feedback information and pass it to rule maintenance module. We will discuss the 
heuristic data later. The association rule Maintenance module extracts the term 
associations from feedback data and combines them in the association rule set for 
query construction. Users can adjust the weight of those new rules. 
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Fig. 1. Maintenance of association rules. 

Here, DB: Original Database. DBi (i=l,2,...); incremental data sets to DB; 
RBli: Rule base in DBi; RBj: results rule base after maintenance. 



The query constructor uses the term association rules and user-input information to 
construct queries. 

The query processor calculates a weighted-cosine similarity between a query and a 
document. 



3.2 Heuristic Data 
. Core Query Words 

All the words in original query are in the set at first. Users can change the words in 
the set and see changes in the graph of term association. It is useful for users to find 
related words and add them to query. 
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Fig. 2. System structure for heuristic query construction. 



. Most Frequent Words in Feedback Data 

In the interactive interface, users can click relevant documents as feedback for next 
round retrieve. Most frequent words in the feedback data are very useful for new 
query because those documents just be in what the users want to find. 

. Exploration of Graph of Term Associations 

A high confidence rule of the form ti — > t2 indicates that t^ often appears in a 
document if tl appears. This suggests a certain type of relations between the terms, 
such as hypemym/hyponym or holonym/meronym, indicating a narrower/broader 
meaning between the terms. These relations characterize what we could call a 
contextual holonony, that is, if ti — > t2, then tj is part of the vocabulary in the topical 
context, which is suggested by the concept denoted by t^. It can help us to understand 
the latent semantics behind those rules and their effects on the later query expansion. 
Users can check all words related to the core query words. 



3.3 Heuristic Construction Process 

In this section, we will show you how the query constructor works. Preprocessed 
original query is a small set of words Q={Wi |i=l, ...,n }. This set is noted as core 
set. Users can operate (add or remove words) the set directly in the interactive 
interface according to heuristic information. Users also can click the relevant 
documents of last round retrieve to provide feedback. Maintenance module mines the 
new rules in the feedback documents and combines them with the old rules. Users can 



Mining Term Association Rules for Heuristic Query Construction 151 

explore the term relation in the rules and operate the core set according to the 
relations. Association rule set R = {wordl — > word2, support, confidence} 



Association rule expansion algorithm 
Q ■<— Original query 
Q1 = Q 

for each rule wordl — > word2 in R 
if wordl in Q then Q1=Q1 +{word2j 
output Q1 



4 The Performance of Query Construction with Association Rules 

We used AP90 from TREC4 as our benchmark. The AP90 corpus contains more than 
78,000 documents. After stemming and the filtering of stop words, we found more 
than 133,000 different terms. The average number of different terms in a document is 
217. The largest documents contain approximately 900 terms and the smallest 5 
terms. 

There are 50 queries or topics, called topics in the TREC terminology, associated with 
AP90. Each topic consists of a natural language proposition or question. For each of 
these topics, a list of relevant documents has been constructed manually and endorsed 
by human experts. This list is provided together with the topic. In average, a topic for 
AP90 calls for 32 documents. The largest topic calls for 145 documents. Two topics 
call no relevant document at all. Topic 202 is “Status of nuclear proliferation treaties 
— violations and monitoring”. It calls for 52 documents. 



4.1 Rocchio Algorithm 



We will compare our association rule expansion method with the Rocchio algorithm 
[14] that is a classic algorithm of extracting information from relevance feedback. We 
will briefly describe the Rocchio algorithm here. 

The basic idea of the Rocchio algorithm is the combining of document vectors in 
relevance feedback and original query vectors. It uses both of positive and negative 
documents to expand the original query. Terms will be assigned positive or negative 
weights base on whether the terms are in relevant or non-relevant documents on 
whether the terms are in feedback. Exactly, centroid of relevant feedback is added to 
the expanded query. The formula of the Rocchio algorithm is as follows [15]: 



a=e«+-Z«, 



«i 

«i ' =1 







where Q„ is the original query , Q, is the modified query , R is the set with relevant 
documents. And the S is the set with non-relevant documents. 
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4.2 Performance on Topic 202 

In this experiment, we study the effectiveness of feedback mechanisms topic 202. 
First, we perform the retrieve with original query. We assume all the relevant 
documents in the top 100 documents are marked as relevant feedback. According to 
the feedback, the Recchio algorithm is performed. 
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Fig. 3. Precision on topic 202 
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Fig. 4. Precision on all topics 

Then we perform the query expansion with global association rules. We also get a 
ranked list and a relevant feedback from the list. A set of association rules is obtained 
by mining the relevant documents. After combining with the old global rules using 
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the maintenance module, we can perform the query expansion again. The retrieve 
result is shown in Figure 3. 

From the Figure 3, both of the lines with relevant feedback have improved the 
precision significantly, especially the maintenance rule method when recall <0.6. 

4.3 Performance on All Topics 

Figure 4 is the corresponding graph of average result of all topics. The result is no so 
exciting as topic 202. The reason is that some queries have very low precisions at the 
first 100 ranked documents. It means that users cannot catch enough relevant 
feedback for the query construction. 



5 Conclusions and Future Works 

In this paper, a new method to handle relevant feedback by association rules 
maintenance. We present kinds of heuristic information to help users understand and 
constmct their queries interactively. Especially, Users can explore the term 
associations effectively by changing the words in core query set. Our experimental 
study showed that the approach could be used as an effective means to represent 
semantics. We also proposed a new model to integrate the feedback information. The 
weight rule maintenance model combines different term relations obtained from the 
feedback chains. Our experiments showed that it is effective for tackling the 
inadequate query problem. 

We plan to extend this work in the following ways. First, since we are mainly 
concerned with the mles type 1, showing the words can be referred from query word. 
But the other two kinds of rules should be useful too thought they are difficult to 
handle currently. We are currently looking into some of these techniques. Second, the 
efficiency of proposed approach is essentially depended on the relevant document 
number in the top N ranked list. So how to improve the precision on low recall is the 
bottleneck of further works. We plan to perform the system on classified text set. 
More original information is available for users. It will benefit the precision on low 
recall. 
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Abstract. In the context of mining frequent itemsets, numerous strate- 
gies have been proposed to push several types of constraints within the 
most well known algorithms. In this paper, we integrate the recently 
proposed ExAnte data reduction technique within the FP-growth algo- 
rithm. Together, they result in a very efficient frequent itemset mining 
algorithm that effectively exploits monotone constraints. 



1 Introduction 

The problem of how to push different types of constraints into the frequent item- 
sets computation has been extensively studied [5,6,3]. However, while pushing 
anti-monotone constraints deep into the mining algorithm is easy and effective, 
the case is different for monotone constraints. Indeed, anti-monotone constraints 
can be used to effectively prune the search space to a small downward closed 
collection, while the upward closed collection of the search space satisfying the 
monotone constraints cannot be pruned at the same time. Recently, it has has 
been shown that a real synergy of these two opposite types of constraints exists 
and can be exploited by reasoning on both the itemset search space and the 
input database together, using the ExAnte data-reduction technique [2]. This 
way, pushing monotone constraints does not reduce anti-monotone pruning op- 
portunities, but on the contrary, such opportunities are boosted. Dually, pushing 
anti-monotone constraints boosts monotone pruning opportunities: the two com- 
ponents strengthen each other recursively. This idea has been generalized in an 
Apriori-like computation in ExAMiner [1]. 

In this paper we show how this synergy can be exploited even better within 
the well known FP-growth algorithm [4]. Thanks to the recursive projecting ap- 
proach of FP-growth, the ExAnte data-reduction is pervasive all over the compu- 
tation. All the FP-trees built recursively during the FP-growth computation can 
be pruned extensively by using the ExAnte property, obtaining a computation 
with a smaller number of smaller trees. We call such a tiny FP-tree, obtained by 
growing and pruning, an FP-bonsai. 

The resulting method overcomes on one hand the main drawback of FP- 
growth, which is its memory requirements, and on the other hand, the main 
drawback of ExAMiner which is the I/O cost of iteratively rewriting the reduced 
datasets to disk. 
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2 Preliminaries 

Let X = {xi, ...,Xn} be a set of items. An itemset A is a non-empty subset 
of I. A transaction is a couple {tid,X) where tid is the transaction identifier 
and X is an itemset. A transaction database T> is a, set of transactions. An 
itemset X is contained in a transaction (tid,Y) ii X Y . The support of an 
itemset X in database T>, denoted by suppj,{X) is the number of transactions 
in T> that contain X . Given a user-defined minimum support a, an itemset X 
is called frequent in T> if supp-p^X) > a. The frequent itemset mining problem 
requires to compute the set of all frequent itemsets.A constraint on itemsets is 
a function C : 2^ — >■ {true, false}. We say that an itemset I satisfies a constraint 
if and only if C{I) = true. Let Th{C) = [X \ C(X) = true} denote the set 
of all itemsets X that satisfy constraint C. In general given a conjunction of 
constraints C the the constrained frequent itemsets mining problem requires to 
compute Th{Cfreq) n Th(C), where Cfreq is the frequency constraint. 

In particular we focus on two kinds of constraint: a constraint Cam is anti- 
monotone if Cam{X) Cam{Y) for all A C A; a constraint Cm is monotone 
if: Cm{X) => CM{y) for all A A A. Since any conjunction of anti-monotone 
constraints is an anti-monotone constraint, and any conjunction of monotone 
constraints is a monotone constraint, we consider without loss of generality the 
conjunction Cfreq H TIi{Cm) where Cm is a simple monotone constraint such as 
sum{X .prices) > n. 

The recently introduced ExAnte method [2] exploits monotone constraints in 
order to to reduce the input database and thus to prune the search space. This 
method is based on the synergy of the following two data-reduction operations: 
(1) pL-reduction, which deletes transactions in T> which do not satisfy Cm', and (2) 
a-reduction, which deletes from all transactions in T> singleton items which do 
not satisfy Cfreq. The ExAnte property states that a transaction which does not 
satisfy the given monotone constraint can be deleted from the input database {p- 
reduction) since it will never contribute to the support of any itemset satisfying 
the constraint. A major consequence of reducing the input database in this way 
is that it implicitly reduces the support of a large amount of itemsets. As a 
consequence, some singleton items can become infrequent and can not only be 
removed from the computation, but they can be deleted from all transactions 
in the input database (a-reduction). This removal also has another positive 
effect. That is, the reduced transaction might violate the monotone constraint. 
Obviously, we are inside a loop where two different kinds of pruning (a and 
p) cooperate to reduce the search space and the input dataset, strengthening 
each other step by step until no more pruning is possible (a fix-point has been 
reached). This is the key idea of the ExAnte preprocessing method [2]. In the 
end, the reduced dataset resulting from this fix-point computation is usually 
much smaller than the initial dataset 

Given a transaction database T>, a conjunction of monotone constraints Cm, 
and a conjunction of anti-monotone constraints Cam, we define the reduced 
dataset obtained by the fix-point application of p and a priming as: Pcj,^^ Cm ’ 
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3 FP-Bonsai 

The FP-growth algorithm [4] stores the actual transactions from the database in 
a trie structure (prefix tree), and additionally stores a header table containing 
all items with their support and the start of a linked list going through all 
transactions that contain that item. This data structure is denoted by FP-tree 
(Frequent-Pattern tree) [4]. For example, consider the transaction database in 
Figure 2(a) and a minimal support threshold of 4. First, all infrequent items are 
removed from the database, all transactions are reordered in support descending 
order and inserted into the FP-tree, resulting in the tree in Figure 2(b). 

Given a transaction database T> and a minimal support threshold ct, we 
denote the set of all frequent itemsets with the same prefix / C I by T\I]{T>, a). 
FP-growth recursively generates for every singleton item {i} G Th{Cfreq) the set 
T[{i}]{V,cj) by creating the so called i-projected database of T> This database, 
denoted 2?*, is made of all transactions in T> containing i, from which i and all 
items which come before i, w.r.t. the support descending order, are deleted. This 
i-projected database, which is again stored as an FP-tree, is recursively mined 
by FP-growth. The FP-growth algorithm is shown in Figure 1. 



Algorithm FP-growth 
Input: T>,a, I C T 
Output: F[I]{T>,a) 

m ■■= {} 

for all i G I occurring in T> do 
F[I] :=^[7]U{7U{i}} 

H -.= {};0' :={} 

for all j G T occurring in D such 
that j > i do 

if suppj,{I U {i,j}) > o then 
77 :=77U{j} 

for all (tid,X) G T> with i G X do 
■-V'U{{tid,Xr^H)} 
Compute F[I U , a) 

T[I] ■.= T[I] U T[I U {i}] 



Algorithm FP-pruning 
Input: T>, Cam, Cm, I 

Output: 

repeat 

/ / /r-pruning of V 

for all transactions t occurring in T> 

do 

if Cm {I U t) = false then 
Remove t from T> 

/ / a-pruning of T> 
for all items i occurring in T> do 
if Cam{I U {i}) = false then 
Remove i from T> 
until nothing changed 



Fig. 1. The FP-growth and FP-pruning algorithms. 



The main trick exploited in FP-growth is that it only needs to find all singleton 
frequent itemsets in the given database. Then, for every such item, it creates 
the corresponding projected database in which again, only the (local) singleton 
frequent itemsets have to be found. This process goes on until no more (local) 
items exist. The FP-tree structure guarantees that all this can be done efficiently. 
In this way, FP-growth implicitly creates a lot of databases, represented by FP- 
trees. The good news is that all these datasets (trees) can be reduced (pruned) 
using the ExAnte technique. We call such a pruned FP-tree an FP-bonsai. 
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Fig. 2. A transaction database (a), the corresponding FP-tree for minimal support 
threshold of 4 (b), the initial FP-bonsai for Example 1 (c), the FP-bonsai after the 
removal of item g (d), and (e) the hnal FP-bonsai. 




Fig. 3. Number of FP-bonsai built with fixed minimum ^support and moving mono- 
tone threshold (a); run time of ExAMiner and FP-bonsai on dataset BMS-POS with 
minimum _support = 200 (b); and minimum sum of prices = 3000 (c). 



The FP-pmning procedure is shown in Figure 1. In order to obtain the com- 
plete algorithm that finds all itemsets satisfying the given constraints, the FP- 
pruning algorithm should be called before the first line of the FP-growth al- 
gorithm. The fact that the database is stored as an FP-tree is not specifically 
mentioned. That is because this is actually not necessary, but the FP-tree is sim- 
ply the most effective data structure for these algorithms to use. How the pruning 
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mechanisms can be effectively applied on the FP-tree structure is described by 
the following Example. 

Example 1. Consider again the transactional database in Figure 2(a) and the 
price-table: {a : 5, 6 : 8, c : 14, d : 30, e : 20, / : 15, g : 6, h : 12}. Suppose that we 
want to compute itemsets with support no less than 4 and sum of prices no less 
than 45. The FP-bonsai construction starts with a first scan of T> to count the 
support of all singleton items. Note that transaction 4 is not used since it does 
not satisfy the monotone constraint. This causes item a and e to be infrequent 
and are not included in the header table. Frequent items are ordered in support 
descending order and the tree is built as seen in Figure 2(c). At this point we 
find that the item g is no longer frequent than we remove all its occurrences 
in the tree using the link-node structure. The resulting pruned tree is in Figure 
2(d). This a-pruning has created a new opportunity for ^-pruning, in fact, the 
path on the right edge of the tree does no longer satisfy the monotone constraint 
and hence it can be removed from the tree. In Figure 2(e) we have the final 
FP-bonsai (the fix-point has been reached) for the given problem. Note the the 
final size of the FP-bonsai is 3 nodes (which represents the unique solution to 
the given problem: itemset bed with support = 4 and sum of prices = 52, while 
the size of the usual FP-tree for the same problem (Figure 2(b)) is 18 nodes! 

Once the FP-bonsai has been built (i.e. once the fix-point of a and p, pruning 
has been reached) we can efficiently mine all frequent itemsets satisfying the 
given constraints using FP-growth. Thanks to the recursive structure of the FP- 
growth based algorithm, the ExAnte property is deeply amalgamated with the 
frequent itemset computation: not only the initial tree is a pruned tree (an FP- 
bonsai), but also all the other projected trees, built during the recursive growing 
phase will be much more smaller in number and in size. 

The reduction of number of trees built w.r.t. FP-growth (which is the compu- 
tation with minimum sum of prices = 0) is dramatic, as illustrated in Figure 3(a). 

Our experimental study confirms that FP-bonsai outperforms ExAMiner, 
which is the state-of-the-art algorithm for the computational problem addressed 
in many occasions. 

From those pictures we can see that ExAMiner is faster only when the selec- 
tivity of one of the two constraints is very strong, and hence the set of solutions 
very small. In particular, ExAMiner is faster in recognizing when the user-defined 
constraints are so selective that the problem has an empty set of solutions. But 
in all the other cases FP-bonsai performs much better. In particular, when one 
of the two constraints is not-so-selective, FP-bonsai exhibits a much more stable 
behaviour, while ExAMiner’s computation time increases quickly. Consider, for 
instance Figure 3(c): at an absolute minimum support of 150, FP-bonsai takes 36 
seconds against the 4841 seconds (1 hour and 20 minutes) taken by ExAMiner. 
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Abstract. GRD is an algorithm for fc-most interesting rule discovery. 
In contrast to association rule discovery, GRD does not require the 
use of a minimum support constraint. Rather, the user must specify 
a measure of interestingness and the number of rules sought (fc). This 
paper reports efficient techniques to extend GRD to support mining 
of negative rules. We demonstrate that the new approach provides 
tractable discovery of both negative and positive rules. 

Keywords: Rule Discovery, Negative Rules. 



1 Introduction 

Rule discovery involves searching through a space of rules to determine rules of 
interest to a user. Association rule discovery [1] seek rules between frequent items 
(literals which satisfy a minimum support constraint). A rule set is developed 
which can be pruned by using further user defined constraints. 

Generalized rule discovery is an alternative rule discovery approach. Rules in 
GRD are developed based on user defined constraints. There is no need to apply 
a minimum support constraint. Rather, the user must specify a number of rules 
to be generated, k. This avoids the inherent limitations of the minimum support 
methodology. 

Mining negative rules from databases has been approached using association 
rule discovery [3,6,12]. We seek to extend GRD to mining negative rules so as 
to enable negative rules to be discovered without the need to specify minimum 
support constraints. 



2 Association Rule Discovery 

Association rule discovery aims to find rules describing associations between 
items [1]. A rule has the form A ^ B, where A is the antecedent and B is the 
consequent. Both A and B are itemsets from the database. The rule implies that 
if an itemset A occurs in a record then itemset B is likely to occur in the same 
record of the database. 

Gonstraints are defined to limit the space of rules to be searched [8]. For 
example, 1000 items define 2^°°° possible combinations of itemsets which results 
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in a large number of rules to explore. The minimum support constraint is used 
to limit the number of itemsets that need be considered. 

The support of an itemset is the frequency with which the itemset occurs 
in the database. The itemsets which satisfy the minimum support constraint 
are frequent itemsets. From these itemsets the rules are developed. Further con- 
straints can be applied to prune the set of rules discovered [7]. 

3 Negative Rules 

Mining negative rules has been given some attention and has proved to be use- 
ful. Initial approaches [3] considered mining negative associations between two 
itemsets. Savasere, Omiecinski and Navathe [6] use the method of generating 
positive rules from which negative rules are mined. The result is that there are 
fewer but more interesting negative rules that are mined. 

Negative association rules are associations rules in which either the an- 
tecedent or consequent or both are negated. For example, for the rule A => 
B the negative rules are A -i B (A implies not B), -iA=4>B, -lA^-iB [12]. 

The rules above specify concrete relationships between each itemset com- 
pared to [6] who look at the rule A ^ B. Another possibility is to consider 
itemsets within the antecedent or the consequent being negated (e.g. (-■ A & B) 
^C). 

4 Generalized Rule Discovery 

In some applications minimum support may not be a relevant criterion to select 
rules. For example, often high-value rules relate to infrequent associations, a 
problem known as the vodka and caviar problem [4] . 

The GRD algorithm [10,11] implements fc-most interesting rule discovery. 
This approach avoids the need to specify a minimum support constraint, re- 
placing it by a constraint on the number of rules to be found together with the 
specification of an interestingness measure. Further constraints may be specified, 
including a minimum support constraint if desired, but these are not required. 

GRD performs the OPUS search [9] through the space of potential an- 
tecedents and for each antecedent the set of consequent conditions are explored. 
The consequent conditions are limited to single condition to simplify the search. 

Space constraints preclude us from presenting the algorithm here. The ex- 
tensions to the base GRD algorithm [11] are however, straightforward. Based on 
the idea of a diffset [13], many of the statistics for negative rules can be derived 
using much statistics already derived for positive rules. Specifically, 

~ support(A & -ix B) = support(A B) — support(A & x B). 

— support(A =;> -iB) = support(A) — support(A => B). 

However, the search space is nonetheless considerably larger, as an increase 
in the number of conditions considered results in an exponential increase in the 
size of the search space that must be explored. 
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5 Experiments 

The modified GRD program is referred to as GRDI ( GRD new Implementation). 
Experiments were carried out on ten datasets with GRDI. Most of the datasets 
used were the same datasets used for the comparison of the GRD system with 
Apriori in [11]. 



Table 1. Execution times of GRD and GRDI 



Data Files 


Records 


GRD 


GRDI 


Ratio 


connect4 


67,557 


20 


106 


5.30 


covtype 


581,012 


835 


1976 


2.37 


ipums.la.99 


88,443 


7 


1634 


233.43 


letter-recognition 


20,000 


1 


34 


34.00 


mush 


8,124 


1 


8 


8.00 


pendigits 


10,992 


1 


28 


28.00 


shuttle 


58,000 


1 


11 


11.00 


soybean- large 


307 


1 


4 


4.00 


splice junction 


3,177 


6 


1872 


312.00 


ticdata2000 


5,822 


7 


647 


92.43 



Nine out of the ten datasets are taken from the UGI Machine Learning and 
KDD repositories [2,5]. The other dataset, ticdata2000 is a market-basket dataset 
used in research in association rule discovery [14] . Three sub ranges were created 
for numeric attributes. Each sub range approximately contained one third of the 
records. The experiments were carried out on a Linux server, with a processor 
speed of 1.20 GHz and main memory of 256 MB RAM. 

In all experiments GRD and GRDI search for the 1000 rules with the highest 
value of the search measure. Leverage. The maximum number of conditions 
available on the left-hand-side was 4 and both systems assume that only a single 
condition was available for the right-hand-side. This will simplify the search 
task. The executions time for GRD and GRDI are presented in Table 1. Two 
observations from the results are: 

1. GRD: Execution times for some large datasets (large number of records) 
are very short and some are very long. e.g. connect4 has 67,557 records and 
requires 20 seconds to develop rules, whereas ipums.la.99 has 88,443 records 
takes only 7 seconds. 

2. GRDI: for most datasets GRDI’s execution time is slightly greater than 
GRD, e.g. mush. However, some datasets require greater execution times for 
GRDI than GRD, e.g. ticdata2002. 

The reason for large increase in the computational time for some datasets (e.g. 
ipums.la.99) is primarily due to the increase in the size of the search space. If few 
negative rules are generated then the execution time is only a little greater. If a 
majority of rules are negative (sometimes all) then the execution times are a lot 
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Table 2. Comparison of Minimum and Maximum Leverage values 





GRD 


GRDI 


Data Files 


min. lev. 


max. lev. 


mean 


min. lev. 


max. lev. 


mean 


connect4 


0.1224 


0.1227 


0.1225 


0.1688 


0.1707 


0.1698 


covtype 


0.1083 


0.1743 


0.1413 


0.2459 


0.2474 


0.2467 


ipums.la.99 


0.2080 


0.2484 


0.2282 


0.2499 


0.2500 


0.2500 


letter-recognition 


0.0455 


0.1459 


0.0957 


0.1020 


0.1499 


0.1395 


mush 


0.1558 


0.2109 


0.1833 


0.1994 


0.4930 


0.3390 


pendigits 


0.0615 


0.1757 


0.1186 


0.1050 


0.1832 


0.1441 


shuttle 


0.0409 


0.1599 


0.1004 


0.0911 


0.2040 


0.1766 


soybean-large 


0.2137 


0.2359 


0.2248 


0.2286 


0.6182 


0.4324 


splice junction 


0.0404 


0.1523 


0.0963 


0.1244 


0.1733 


0.1489 


ticdata2000 


0.1899 


0.1922 


0.1910 


0.2184 


0.5341 


0.3763 



greater. The increase in execution time is directly proportionate to the increase 
in size of the search space. 

The comparison of minimum leverage values of the rules generated by both 
systems shows that GRDI always contains negative rules in its solution. For the 
datasets in which GRDI’s execution times were greater than GRD, rules with 
much higher leverage were also generated. This is also true of the maximum 
leverage values. An important observation is that for several datasets the min- 
imum leverage value of GRDI is greater than the maximum leverage value of 
GRD. The information contained within the observation is that all the rules 
generated for those datasets are negative rules. 

6 Conclusions 

GRD has been extended to discover negative rules, providing negative rule dis- 
covery without the need to specify a minimum support constraint. This is useful 
because such a constraint is not appropriate for some domains and can prevent 
potentially interesting rules from being discovered. An additional advantage of 
GRD is that users can generate a specific number of rules that maximize a 
particular search measure. Incorporating the diffsets technique results in low 
additional computational time. 

The GRD algorithm is modified to iterate through two antecedent sets. One 
for positive items and the other for negative items. Within each iteration a 
second consequent set of negative items is explored in addition to the positive 
consequent set. 

A comparison of GRD and GRDI shows that for several datasets GRDI 
took substantially longer to execute than GRD. The reason for this increase 
in execution time is because these particular datasets contained many more 
negative rules than positive rules. With the increased size of the search space, 
the execution times are bound to be longer for GRDI. For some datasets only 
negative rules were generated. In conclusion, developing negative and positive 
rules using GRD is tractable. 
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Abstract. In this paper, we propose a new method of applying 
association rules for recommendation systems. Association rule algo- 
rithms are used to discover associations among items in transaction 
datasets. However, applying association rule algorithms directly to make 
recommendations usually generates too many rules; thus, it is difficult 
to find interesting recommendations for users among so many rules. 
Rule templates define certain types of rules; therefore, they are one 
of the interestingness measures that reduce the number of rules that 
do not interest users. We describe a new method. By defining more 
appropriate rule templates, we are able to extract interesting rules for 
users in a recommendation system. Experimental results show that our 
method increases the accuracy of recommendations. 

Keywords: Association Rule, Rule Templates, Recommendation Sys- 
tems 



1 Introduction 

Generally speaking, there are two types of recommendation systems, content- 
based recommendation systems and collaborative recommendation systems [1]. 
Content-based recommendation systems, such as NewsWeeder [2] and InfoFinder 
[3], require representative properties from the data, which are hard to extract. 
Collaborative recommendation systems, such as GroupLens [4] and Ringo [5], 
observe the behaviors, patterns of the current users, and make recommendations 
based on the similarities between the current users and the new users. 

Not many research efforts are found on applying association rule algorithms 
for collaborative recommendation systems. The association rule algorithm was 
first introduced in [6] . An association rule [6] is a rule of the form a — >■ /3, where 
both the antecedent a and the consequent (3 represent itemsets. The quality 
of the rule is traditionally measured by support and confidence [6]. The disad- 
vantage of using association rule algorithms for collaborative recommendation 
systems is that there are usually too many rules generated, and it is difficult 
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to make recommendations to the users effectively and efficiently. To solve this 
problem, Lin et al. [7] designed a new algorithm to adjust the value of support 
and to limit the number of rules generated. Klemettinen et al. [8] proposed rule 
templates to find interesting rules from a large set of association rules. Rule 
templates describe patterns for those items that appear both in the antecedent 
and in the consequent of the association rules. Defining rule templates more 
appropriately will result in extracting only interesting rules that match the tem- 
plates. In this paper, we explore the usage of rule templates on apriori association 
rule algorithm for collaborative recommendations systems to generate interest- 
ing rules. The EachMovie dataset [9] is used in our experiments, in which we 
consider the merit of movie genres for the generation of rule templates. Our ex- 
perimental results indicate that, with rule templates, association rule algorithms 
can efficiently extract interesting rules for collaborative recommendations. 

The rest of the paper is organized as follows. Section 2 discusses our methods 
of making recommendations. The experiments and evaluations are discussed in 
section 3. The final section makes concluding remarks and describes our future 
work. 



2 Our Method 

We would like to apply the association rule algorithm for recommendation sys- 
tems. In addition to using support and confidence, we examine the role of rule 
templates to predict the items in which users are most likely interested. 

Since we are interested in predicting movies and making recommendations 
efficiently, we define rule templates to reduce the number of rules. 

Template 1, {Moviei , . . . , MoviCm) — t {Movicn), specifies that there is only 
one consequent in the generated rules. 

Template 2, (Genrei fl . . . fl Genrem H GenrCn) yf </>, specifies that only rules 
whose antecedents and consequents all belonging to the same movie genre will 
be generated. In our method, no score or vote information related to the item 
in the recommendation system is considered. 

3 Experiment 

3.1 EachMovie Dataset 

For this work, we use the EachMovie dataset, the most commonly used test bed 
for collaborative recommendation task, provided by Compaq’s Systems Research 
Center [9]. This data set is a collection of users’ votes on 1,628 different movies 
from 72,916 users over an 18 month period. Each movie is assigned to no less 
than one of the 10 possible movie genres. Each movie is rated based on a five 
star evaluation scheme. By removing movies that have no votes and users that 
have never voted, we are left with 61,265 valid users, 1,623 valid movies and 
2, 811, 983 votes. 



168 



J. Li, B. Tang, and N. Cercone 



For our experiments, only a subset of the data is considered. By counting 
the frequency of the number of movies that each user voted for, we know that 
about 40% users voted for no less than 2% (roughly 32) of all the movies. In the 
following experiments, we consider only users who have voted for no less than 
32 movies. 

According to the kind of rules we wish to generate, we organize the transac- 
tion dataset such that each transaction represents all the movies voted for by a 
user. 

3.2 Performance Evaluation 

We use accuracy, defined as a = |, to evaluate the performance of our method. 
This function gives the accuracy a computed as a function of c and t, where c 
stands for the number of times the antecedent and the consequent of a rule belong 
to the same transaction, and t stands for the number of times the antecedent 
belongs to a transaction. 

3.3 Experiment Results 

In our experiment, we applied Borgelt’s apriori algorithm [10] to generate fre- 
quent itemsets.^ Our rule templates and rule generation procedure were im- 
plemented by C-|— k, and the target compiler and platform are g-k- 1- and Unix 
respectively. We performed 4-fold cross validation for the following experiments. 
All the experiments were performed on Sun Fire V880, four 900Mhz UltraSPARC 
III processors, with 8GB of main memory. We applied the two templates to two 
subsets of data which are commonly used [7], [11]. 

First Trial. The first subset we tried is from [11]. Training data represent the 
first 1, 000 users who have rated more than 100 movies. Testing data come from 
the first 100 users whose user ID is larger than 70, 000, and who also rated more 
than 100 movies. 

Table 1 a) shows the performance of this experiment when confidence is 80%. 
The first column shows the support value, the second column shows the accuracy 
from applying the first template to our algorithm, and the third column shows 
the accuracy of adding genre information (applying Template 2). Table 1 b) 
shows the performance when confidence is 90%. From Table 1, we can see that 
when applying movie genre information, extracting only the rules that all the 
movies belong to the same genre, we obtain a higher accuracy. 

In order to show that the computing overhead is also reduced by applying 
Template 2, we generated the number of itemsets and rules, as shown in Table 
2 . 

Table 2 show that using Template 2 reduces the number of rules generated. 
As support value gets lower, there are more frequent itemsets generated, thus 
more rules are generated. The accuracy increases as the support value decreases. 

^ Downloaded from http://fuzzy.cs. uni-magdeburg.de/~borgelt/software.html7^assoc 
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Table 1. a) Accuracy when confidence = 80%, b) Accuracy when Confidence = 90% 



Support 


Accuracy JRules 


Accuracy_Genre 


Support 


Accuracy _Rules 


Accuracy _Genre 


70% 


73.30% 


75.28% 


70% 


79.97% 


80.25% 


60% 


75.83% 


77.42% 


60% 


81.20% 


82.93% 


50% 


78.33% 


79.52% 


50% 


84.62% 


85.87% 


40% 


80.78% 


82.43% 


40% 


87.25% 


88.35% 



(a) (b) 



Table 2. a) Itemset Size and Rule Size when confidence = 80%, b) Itemset Size and 
Rule Size when confidence = 90% 



Min 

support 


Frequent 

Itemsets 


Assoc. 

Rules 


With Genre 
Rules 


Min 

support 


Frequent 

Itemsets 


Assoc. 

Rules 


With Genre 
Rules 


70% 


272 


608 


198 


70% 


272 


392 


127 


60% 


2,773 


8,303 


1,439 


60% 


2,773 


5,074 


805 


50% 


35,276 


139, 796 


10,385 


50% 


35,276 


79, 353 


5,404 


40% 


690,382 


3,525,426 


88,298 


40% 


690, 382 


1,994,580 


49, 278 



(a) (b) 



Table 3. a) Accuracy when confidence = 80%, b) Accuracy when confidence = 90% 



Support 


Accuracy_Rules 


Accuracy_Genre 


Support 


Accuracy .Rules 


Accuracy .Genre 


20% 


78.30% 


87.04% 


20% 


75% 


100% 


10% 


75.79% 


91.83% 


10% 


81.44% 


98.67% 


5% 


78.65% 


93.86% 


5% 


82.87% 


97.64% 


4% 


81.27% 


94.72% 


4% 


83.37% 


97.64% 



(a) (b) 



Table 4. a) Itemset Size and Rule Size when confidence = 80%, b) Itemset Size and 
Rule Size when confidence = 90% 



Min 

support 


Frequent 

Itemsets 


Assoc. 

Rules 


With Genre 
Rules 


20% 


171 


6 


4 


10% 


9,023 


3,788 


297 


5% 


579,291 


745, 971 


12,954 


4% 


2,326,891 


3,900,287 


41,626 



Min 

support 


Frequent 

Itemsets 


Assoc. 

Rules 


With Genre 
Rules 


20% 


171 


86 


30 


10% 


9,023 


15,671 


1,360 


5% 


579,291 


1,926,017 


37,855 


4% 


2,326,891 


9, 154, 962 


104, 589 



(a) (b) 



By increasing the confidence, the accuracy will also be increased; thus, better 
quality rules will be extracted. 



Second Trial. The second subset we tried is from [7] . We used training data for 
the first 2,000 users. Testing data comes from users whose like ratios are less 
than 0.75, from which we randomly selected 20 users as one test set. We repeated 
this choice of test set 4 times, from which we obtained the average accuracy. The 
accuracy is shown by Table 3, which shows an average of more than 15% increase 
in accuracy using Template 2. 
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We list the number of itemsets and rules generated using different support 
values for different confidence levels in Table 4. As we can see, when confidence 
gets higher, there are fewer rules generated; when support gets higher, fewer 
rules are generated. 

4 Conclusions and Future Work 

We have proposed a new method of applying the association rule algorithms 
for recommendation systems. Unlike most current recommendation systems, our 
method does not consider score or vote information associated with every recom- 
mended item. By applying appropriate rule templates, we achieved interesting 
rules. Experimental results show that rule templates increase the accuracy of 
recommendations. As future work, we would like to explore the performance of 
different interestingness measures on recommendation systems. 
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Abstract. A novel incremental feature extraction and classification 
system is proposed. Kernel PCA is famous nonlinear feature extraction 
method. The problem of Kernel PCA is that the computation becomes 
prohibitive when the data set is large. Another problem is that, in order 
to update the eigenvectors with another data, the whole eigenspace 
should be recomputed. Proposed feature extraction method overcomes 
these problems by incrementally eigenspace update and using empirical 
kernel map as kernel function. Proposed feature extraction method is 
more efficient in memory requirement than a Kernel PCA and can be 
easily improved by re-learning the data. For classihcation extracted 
features are used as input for Least Squares SVM. In our experiments 
we show that proposed feature extraction method is comparable in 
performance to a Kernel PCA and proposed classification system shows 
a high classification performance on UCI benchmarking data and NIST 
handwritten data set. 

Keywords: Incremental PCA, Kernel PCA, Emperical kernel map, LS- 
SVM 

1 Introduction 

In many pattern recognition problem it relies critically on efficient data represen- 
tation. It is therefore desirable to extract measurements that are invariant or in- 
sensitive to the variations within each class. The process of extracting such mea- 
surements is called /eatwre extraction. Principal Component Analysis(PCA)[l] is 
a powerful technique for extracting features from possibly high-dimensional data 
sets. For reviews of the existing literature is described in [2] [3] [4]. Traditional 
PCA, however, has several problems. First PCA requires a batch computation 
step and it causes a serious problem when the data set is large i.e., the PCA 
computation becomes very expensive. Second problem is that, in order to update 

* This study was supported by a grant of the Korea Health 21 R&D Project, Ministry 
of Health & Welfare, Republic of Korea (02-PJ1-PG6-HI03-0004) 
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the subspace of eigenvectors with another data, we have to recompute the whole 
eigenspace. Finial problem is that PCA only defines a linear projection of the 
data, the scope of its application is necessarily somewhat limited. It has been 
shown that most of the data in the real world are inherently non-symmetric and 
therefore contain higher-order correlation information that could be useful [5]. 
PCA is incapable of representing such data. For such cases, nonlinear trans- 
forms is necessary. Recently kernel trick has been applied to PCA and is based 
on a formulation of PCA in terms of the dot product matrix instead of the co- 
variance matrix[8]. Kernel PCA(KPCA), however, requires storing and finding 
the eigenvectors of a iV x A^ kernel matrix where is a number of patterns. It 
is infeasible method when N is large. This fact has motivated the development 
of incremental way of KPCA method which does not store the kernel matrix. 
It is hoped that the distribution of the extracted features in the feature space 
has a simple distribution so that a classifier could do a proper task. But it is 
point out that extracted features by KPCA are global features for all input 
data and thus may not be optimal for discriminating one class from others [6]. 
This has naturally motivated to combine the feature extraction method with 
classifier for classification purpose. In this paper we propose a new classifier 
for on-line and nonlinear data. Proposed classifier is composed of two parts. 
First part is used for feature extraction. To extract nonlinear features, we pro- 
pose a new feature extraction method which overcomes the problem of memory 
requirement of KPCA by incremental eigenspace update method incorporat- 
ing with an adaptation of kernel function. Second part is used for classificaion. 
Extracted features are used as input for classification. We take Least Squares 
Support Vector Machines(LS-SVM)[7] as a classifier. LS-SVM is reformulations 
to the standard Support Vector Machines(SVM)[8]. SVM typically solving prob- 
lems by quadratic programming(QP). Solving QP problem requires complicated 
computational effort and needs more memory requirement. LS-SVM overcomes 
this problem by solving a set of linear equations in the problem formulation. 
Paper is composed of as follows. In Section 2 we will briefly explain the in- 
cremental eigenspace update method. In Section 3 KPCA is introduced and to 
make KPCA incrementally empirical kernel map method is is explained. Pro- 
posed classifier combining LS-SVM with proposed feature extraction method is 
described in Section 4. Experimental results to evaluate the performance of pro- 
posed classifier is shown in Section 5. Discussion of proposed classifier and future 
work is described in Section 6. 



2 Incremental Eigenspace Update Method 

In this section, we will give a brief introduction to the method of incremental 
PCA algorithm which overcomes the computational complexity and memory re- 
quirement of standard PCA. Before continuing, a note on notation is in order. 
Vectors are columns, and the size of a vector, or matrix, where it is important, is 
denoted with subscripts. Particular column vectors within a matrix are denoted 
with a superscript, while a superscript on a vector denotes a particular observa- 
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tion from a set of observations, so we treat observations as column vectors of a 
matrix. As an example, is the ith column vector in an m x n matrix. We 
denote a column extension to a matrix using square brackets. Thus [A^nb] is an 
(m X (n + 1)) matrix, with vector b appended to Amn as a last column. 

To explain the incremental PCA, we assume that we have already built a set 
of eigenvectors U = [uj\,j = after having trained the input data 

Xi,i = 1, - ■ ■ , N. The corresponding eigenvalues are A and x is the mean of input 
vector. Incremental building of Eigenspace requires to update these eigenspace 
to take into account of a new input data. Here we give a brief summarization of 
the method which is described in [9]. First, we update the mean: 

x' = j^^{Nx + Xn+i) ( 1 ) 

We then update the set of Eigenvectors to reflect the new input vector and 
to apply a rotational transformation to U. For doing this, it is necessary to 
compute the orthogonal residual vector h = (C/otv+i +lr) — x^+i and normalize 
it to obtain h/v+i = for || hff+i || 2 > 0 and = 0 otherwise. We 

obtain the new matrix of Eigenvectors U by appending hjv+i to the eigenvectors 
U and rotating them : 

U' = [U,hN+i]R ( 2 ) 

where R G R.(k+i)x(k+i) is a rotation matrix. R is the solution of the eigenprob- 
lem of the following form: 

DR = RA' (3) 

where A' is a diagonal matrix of new Eigenvalues. We compose D G R.(k+i)x(k+i) 



N 


■ A O' 


N 


aa^ 7 a 


iV+ 1 


0 

1-3 

0 


(iV+l)2 


(N 



where 7 = hJ^_i_j^(xjv+i — x) and a = U'^{xm+i — x). Though there are other 
ways to construct matrix the only method ,however, described in [9] 

allows for the updating of mean. 



2.1 Eigenspace Updating Criterion 

The incremental PCA represents the input data with principal components ai(]v) 
and it can be approximated as follows: 



Xi(N) — U ai{N) + X 



( 5 ) 



To update the principal components an^^) for a new input xn+i , computing 
an auxiliary vector rj is necessary. 77 is calculated as follows: 



77 = 



1 T 



Uhf^+i 



{x 



x') 



( 6 ) 



then the computation of all principal components is 
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fli(Af+l) — (R') 



O-i(N) 

0 






( 7 ) 



The above transformation produces a representation with fc + 1 dimensions. Due 
to the increase of the dimensionality by one, however, more storage is required 
to represent the data. If we try to keep a fc-dimensional eigenspace, we lose a 
certain amount of information. It is needed for us to set the criterion on retaining 
the number of eigenvectors. There is no explicit guideline for retaining a number 
of eigenvectors. Here we introduce some general criteria to deal with the model’s 
dimensionality: 

— Adding a new vector whenever the size of the residual vector exceeds an 
absolute threshold; 

— Adding a new vector when the percentage of energy carried by the last 
Eigenvalue in the total energy of the system exceeds an absolute threshold, 
or equivalently, defining a percentage of the total energy of the system that 
will be kept in each update; 

— Discarding Eigenvectors whose Eigenvalues are smaller than a percentage of 
the first Eigenvalue; 

— Keeping the dimensionality constant. 

In this paper we take a rule described in b). We set our criterion on adding 
an Eigenvector as > 0.7A where A is a mean of the A. Based on this rule, 
we decide whether adding or not. 



3 Incremental KPCA 

A prerequisite of the incremental Eigenspace update method is that it has to 
be applied on the data set. Furthermore incremental PCA builds the subspace 
of Eigenvectors incrementally, it is restricted to apply the linear data. But in 
the case of KPCA this data set <P{x^) is high dimensional and can most of 
the time not even be calculated explicitly. For the case of nonlinear data set, 
applying feature mapping function method to incremental PCA may be one 
of the solutions. This is performed by so-called kernel-trick, which means an 
implicit embedding to an infinite dimensional Hilbert space[8](i.e. feature space) 
F. 

K (x,y) = <P{x) ■ ^{y) ( 8 ) 

Where AT is a given kernel function in an input space. When K is semi positive 
definite, the existence of ^ is proven [8]. Most of the case ,however, the mapping 
(p is high-dimensional and cannot be obtained explicitly. The vector in the fea- 
ture space is not observable and only the inner product between vectors can be 
observed via a kernel function. However, for a given data set, it is possible to ap- 
proximate <P by empirical kernel map proposed by Scholkopf[12] and Tsuda[13] 
which is defined as Fn : — >■ 

<1'n{x) = [^(xi) •^(x),---,^(xat) -^{x)]^ 

= [K{xi,x),--- ,K{xN,x)f 



( 9 ) 
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A performance evaluation of empirical kernel map was shown by Tsuda. He shows 
that support vector machine with an empirical kernel map is identical with the 
conventional kernel map[13]. The empirical kernel map •f'Ar(a; 7 v) , however, do 
not form an orthonormal basis in R'^, the dot product in this space is not the 
ordinary dot product. In the case of KPCA , however, we can be ignored as 
the following argument. The idea is that we have to perform linear PCA on 
the ^n{xn) from the empirical kernel map and thus diagonalize its covariance 
matrix. Let the N x N matrix ^ = [^j^{xi),^n{x 2 ), ■ ■ ■ ,'1'n{xn)], then from 
equation (9) and definition of the kernel matrix we can construct 'P = NK. The 
covariance matrix of the empirically mapped data is: 

= NKK'^ = NK^ ( 10 ) 

In case of empirical kernel map, we diagonalize NK'^ instead of K as in KPCA. 
Mika shows that the two matrices have the same eigenvectors {ufe}[14]. The 
eigenvalues {Afe} of K are related to the eigenvalues {k^} of NK^ by 




and as before we can normalize the eigenvectors {ffc} for the covariance matrix 
C of the data by dividing each {uk} by ^XkN. Instead of actually diagonalize 
the covariance matrix the IKPCA is applied directly on the mapped data 
'F = NK. This makes it easy for us to adapt the incremental eigenspace update 
method to KPCA such that it is also correctly takes into account the centering 
of the mapped data in an incremental way. By this result, we only need to apply 
the empirical map to one data point at a time and do not need to store the 
N X N kernel matrix. 



4 Proposed Classification System 

In earlier Section 3 we proposed an incremental KPCA method for nonlinear 
feature extraction. Feature extraction by incremental KPCA effectively acts a 
nonlinear mapping from the input space to an implicit high dimensional feature 
space. It is hoped that the distribution of the mapped data in the feature space 
has a simple distribution so that a classifier can classify them properly. But it is 
point out that extracted features by KPCA are global features for all input data 
and thus may not be optimal for discriminating one class from others. For clas- 
sification purpose, after global features are extracted using they must be used 
as input data for classification. There are many famous classifier in machine 
learning field. Among them neural network is popular method for classification 
and prediction purpose. Traditional neural network approaches, however have 
suffered difficulties with generalization, producing models that can overfit the 
data. To overcome the problem of classical neural network technique, support 
vector machines(SVM) have been introduced. The foundations of SVM have been 
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developed by Vapnik and it is a powerful methodology for solving problems in 
nonlinear classification. Originally, it has been introduced within the context of 
statistical learning theory and structural risk minimization. In the methods one 
solves convex optimization problems, typically by quadratic programming(QP). 
Solving QP problem requires complicated computational effort and need more 
memory requirement. LS-SVM overcomes this problem by solving a set of lin- 
ear equations in the problem formulation. LS-SVM method is computationally 
attractive and easier to extend than SVM. 



5 Experiment 

To evaluate the performance of proposed classification system, experiment is 
performed by following step. First we evaluate the feature extraction ability of 
incremental KPCA(IKPCA). The disadvantage of incremental method is their 
accuracy compared to batch method even though it has the advantage of memory 
efficiency. So we shall apply proposed method to a simple toy data and image 
data set which will show the accuracy and memory efficiency of incremental 
KPCA compared to APEX model proposed by Kung[15] and batch KPCA. Next 
we will evaluate the training and generalization ability of proposed classifier on 
UCI benchmarking data and NIST handwritten data set. To do this, extracted 
features by IKPCA will be used as input for LS-SVM. 



5.1 Toy Data 

To evaluate the feature extraction accuracy and memory efficiency of IKPCA 
compared to APEX and KPCA we take nonlinear data used by Scholkoff[5] . 
Totally 41 training data set is generated by: 

y = x^ + 0.2e: e from N{0,l),x = [—1,1] (12) 

First we compare feature extraction ability of IKPCA to APEX model. APEX 
model is famous principal component extractor based on Hebbian learning rule. 
Applying toy data to IKPCA we finally obtain 2 eigenvectors. To evaluate the 
performance of two methods on same condition, we set 2 output nodes to stan- 
dard APEX model. 

In table 1 we experimented APEX method on various conditions. Generally 
neural network based learning model has difficulty in determining the param- 
eters; for example learning rate, initial weight value and optimal hidden layer 
node. This makes us to conduct experiments on various conditions. || w || is norm 
of weight vector in APEX and || w ||= 1 means that it converges stable minimum. 
cos6 is angle between Eigenvector of KPCA and APEX, IKPCA respectively. 
COS0 of Eigenvector can be a factor of evaluating accuracy how much IKPCA and 
APEX is close to accuracy of KPCA. Table 1 nicely shows the two advantages of 
IKPCA compared to APEX: first, performance of IKPCA is better than APEX; 
second, the performance of IKPCA is easily improved by re-learning. Another 
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Table 1. Performance evaluation of IKPCA and APEX 



Method 


Iteration 


Learning Rate 


II II 


II ^2 II 


cos 6 i 


COS 62 


MSE 


APEX 


50 


0.01 


0.6827 


1.4346 


0.9993 


0.7084 


14.8589 


APEX 


50 


0.05 








do not converge 




APEX 


500 


0.01 


1.0068 


1.0014 


0.9995 


0.9970 


4.4403 


APEX 


500 


0.05 


1.0152 


1.0470 


0.9861 


0.9432 


4.6340 


APEX 


1000 


0.01 


1.0068 


1.0014 


0.9995 


0.9970 


4.4403 


APEX 


1000 


0.05 


1.0152 


1.0470 


0.9861 


0.9432 


4.6340 


IKPCA 


100 




1 


1 


1 


1 


0.0223 



factor of evaluating accuracy is reconstruction error. Reconstruction error is de- 
fined as the squared distance between the image of xn and reconstruction when 
projected onto the first i principal components. 

5=\^{xn)-Pi'P{xn)? (13) 

In here Pi is the first i principal component. The MSE (Mean Square Error) 
value of reconstruction error in APEX is 4.4403 whereas IKPCA is 0.0223. This 
means that the accuracy of IKPCA is superior to standard APEX and similar to 
that of batch KPCA. Above results of simple toy problem indicate that IKPCA 
is comparable to the batch way KPCA and superior in terms of accuracy. 

Next we will compare the memory efficiency of IKPCA compared to KPCA. 
To extract nonlinear features, IKPCA only needs D matrix and R matrix whereas 
KPCA needs kernel matrix. Table 2 shows the memory requirement of each 
method. Memory requirement of standard KPCA is 93 times more than IKPCA. 
We can see that IKPCA is more efficient in memory requirement than KPCA and 
has similar ability in extracting nonlinear features. By this simple toy problem 
we can show that IKPCA has similar ability in extracting nonlinear features 
compare to KPCA and more efficient in memory requirement than KPCA. 



Table 2. Memory efficiency of IKPCA compared to KPCA on toy data 





KPCA 


IKPCA 


Kernel matrix 


41 X 41 


none 


R matrix 


none 


3X3 


D matrix 


none 


3X3 


Efficiency ratio 


93.3889 


1 



5.2 Reconstruction Ability 

To compare the reconstruction ability of incremental eigenspace update method 
proposed by Hall to APEX model we conducted experiment on US National 
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R3 Rfl 

1^ « E3 



Fig. 1. Reconstructed image by IKPCA and APEX 



Institute of Standards and Technology(NIST) handwritten data set. Data has 
been size-normalized and 16 X 16 images with their values scaled to the interval 
[0,1]. Applying this data to incremental eigenspace update method we finally 
obtain 6 Eigenvectors. As earlier experiment we set 6 output nodes to standard 
APEX method. Figure 1 shows the original data and their reconstructed images 
by incremental eigenspace update method and APEX respectively. We can see 
that reconstructed features by incremental eigenspace update method is more 
clear and similar to original image compared to APEX method. 

5.3 UCI Machine Learning Repository 

To test the performance of proposed classifier for real world data, we enlarge our 
experiment to the Cleveland heart disease data and wine data obtained from the 
UCI Machine Learning Repository. Detailed description of data is available from 
web site( http://www.ics.uci.edu/ mlearn/MLSummary.html). In this problem 
we randomly split training data as 80% and remaining as test data. A RBF 
kernel has been taken with and obtained by 10-fold cross-validation procedure to 
select the optimal hyperparameter. Table 3 shows the learning and generalization 
ability by proposed classifier. 



Table 3. Training and generalization result by proposed classifier on UCI Machine 
Learning Repository 





Training 


Generalization 


Eigenvalue update criterion 


Cleveland heart-disease 


100% 


98.69% 


a' > 0.7A 


Wine data 


100% 


98.62% 


a' > 0.7A 



By this result we can see that proposed classification system classifies well 
on specific data. 
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5.4 NIST Handwritten Data Set 

To validate the above results on a widely used pattern recognition benchmark 
database, we conducted classification experiment on the NIST data set. This 
database originally contains 15,025 digit images. For computational reasons, we 
decided to use a subset of 2000 data set, 1000 for training and 1000 for testing. 
In this problem we use multiclass LS-SVM classifier proposed by Suykens[16]. 
An important issue for SVM is model selection. In [17] it is shown that the use 
of 10-fold cross-validation for hyperparameter selection of LS-SVMs consistently 
leads to very good results. In this problem RBF kernel has been taken and 
hyperparameter 71 = 1.5198, 72 = 179.731, 73 = 10.51, 74 = 12.81 and a\ = 
67.416, CT2 = 656.351, 0-3 = 54.349, 0-4 = 57.909 are obtained by 10-fold cross- 
validation technique. 



Table 4. Training and generalization result on NIST handwritten data 





Training 


Generalization 


Eigenvalue update criterion 


Proposed Classifier 


100% 


98.8% 


a' > 0.7A 



Table 5. Misclassification frequency by proposed classification system on test data 



Pattern 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Total 


Frequency 


0 


0 


0 


3 


4 


0 


0 


5 


0 


0 


12 



The results on the NIST data are given in Table 4 and 5. For this widely 
used pattern recognition problem, we can see that proposed classification system 
classifies well on given data. 

6 Conclusion and Remarks 

This paper is devoted to the exposition of a new technique on extracting nonlin- 
ear features and classification system from the incremental data. To develop this 
technique, we apply an incremental eigenspace update method to KPCA with 
an empirical kernel map approach. Proposed IKPCA has following advantages. 
Firstly, IKPCA has similar feature extracting performance for incremental and 
nonlinear data comparable to batch KPCA. Secondly, IKPCA is more efficient 
in memory requirement than batch KPCA. In batch KPCA the N x N kernel 
matrix has to be stored, while for IKPCA requirements are 0{{k + 1)^). Here 
fe(l < k < N) is the number of Eigenvectors stored in each eigenspace updating 
step, which usually takes a number much smaller than N. Thirdly, IKPCA allows 
for complete incremental learning using the eigenspace approach, whereas batch 
KPCA recomputes whole decomposition for updating the subspace of eigenvec- 
tors with another data. Finally, experimental results show that extracted features 
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from IKPCA lead to good performance when used as a pre-preprocess data for 
a LS-SVM. 
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A Metric Approach to Building Decision Trees 
Based on Goodman-Kruskal Association Index 



Abstract. We introduce a numerical measure on sets of partitions 
of finite sets that is linked to the Goodman-Kruskal association index 
commonly used in statistics. This measure allows us to define a metric 
on such partions used for constructing decision trees. Experimental 
results suggest that by replacing the usual splitting criterion used in 
C4.5 by a metric criterion based on the Goodman-Kruskal coefficient 
it is possible, in most cases, to obtain smaller decision trees without 
sacrificing accuracy. 

Keywords: Goodman-Kruskal association index, metric, partition, de- 
cision tree 

1 Introduction 

The construction of decision trees is centered around the selection algorithm of an 
attribute that generates a partition of the subset of the training data set that is 
located in the node about to be split. Over the years, several greedy techniques 
for choosing the splitting attribute have been proposed including the entropy 
gain and the gain ratio [1], the Gini index [2], the Kolmogorov-Smirnov metric [3, 
4] , or a metric derived from Shannon entropy [5] . In our previous work [6] we ex- 
tended the metric splitting criterion introduced by L. de Mantaras by introducing 
metrics on the set of partitions of a finite set constructed by using generalized 
conditional entropy (which correspond to a generalization of entropy introduced 
by Daroczy [7]). This paper introduces a different type of metric on partitions of 
finite sets that is generated by a coefficient derived from the Goodman-Kruskal 
association index and shows that this metric can be applied succesfully to the 
construction of decision trees. 

The purpose of this note is to define a metric on the set of partitions of 
a finite set that is derived from to the Goodman-Kruskal association index. A 
general framework of classification can be formulated starting with two finite 
random variables 
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We assume that we deal with a finite probability space where the elementary 
events are pairs of values (a,, bj), where a is a value of X and bj is a value of Y. 
The classification rule adopted here is that an elementary event is classified in the 
class that has the maximal probability. Thus, in the absence of any knowledge 
about X, an elementary event will be classified in the T-class bj if bj corresponds 
to the highest value among the probabilities P{Y = bj) for 1 < j < fc. If 
P{Y = bj\X = Qi) is the probability of predicting the value bj for Y when 
X = tti, then an event that has the component X = Ui will be classified in 
the F-class bj if j is the number for which P{Y = bj\X = Ui) has the largest 
value. The probability of misclassification committed by applying this rule is 
1 - maxi<j<fc P{Y = bj\X = a^). 

The original Goodman-Kruskal association index Xy\x (see [8,9]) is the rel- 
ative reduction in the probability of prediction error: 

•^y|x 

_ ^ GK{X,Y) 

~ l-maxi<3<fc P(y=h3) 

_ J2[=i P(^=ai) maxi<j<fc P{Y=bj\X=ai)-m&Xi<j<k P(Y=bj) 

~ l-maxi<„<fc P(y=b 3 ) 

In other words, Xy\x is the proportion of the relative error in predicting the 
value of Y that can be eliminated by knowledge of the X-value. 

The Goodman-Kruskal coefficient of X and Y that we use is defined by: 




Thus, GK(X, Y) is the expected value of the probability of misclassification. This 
coefficient is related to Xy\x by: 



G{X,Y) = (I - Ay|;f) (^1 - max^P(K = 6,)) 

Next, we formulate a definition of the Goodman-Kruskal coefficient GK within an 
algebraic setting, using partitions of finite sets. The advantage of this formulation 
is the possibility of using lattices of partitions of finite sets and various operations 
on partitions. 

A partition of a set ^ is a collection of nonempty subsets oi S, n = {Bi \ 
is/} such that Bi fl B ^ = 0 for every i,j G / such that i ^ j and Uie/ ~ 
Note that a partition ir = {Bi, . . . , //;} of a finite set S generates a finite random 
variable: 



X : 




1 ••• I 

Pi ■■■ Pi 
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where pi = for 1 < i < and thus, the Goodman-Kruskal coefficient can be 
formulated in terms of partitions of finite sets. 

If 7r,cr G PART (S') we write tt < cr if each block of tt is a subset of a block 
of (T. This is equivalent to saying that every block of <t is a union of blocks of 
TT. We obtain a partial ordered set (PART(S),<). The least partition of S is 
the unit partition is = {{a} | aSS}; the largest partition is the one-block 
partition wg = {S}. The partial ordered set (PART(S),<) is a semi-modular 
lattice (see [10]), where inf{7r,(r} is the partition tt A cr whose blocks consist of 
intersections of blocks B C\C, where S G tt and C G a. Note that tt is covered 
by cr (that is, tt < cr and there is no 0 G PART(5') such that tt < 6 < a) if and 
only if cr is obtained from tt by fusing together two blocks of tt. 

The trace of a partition tt = {Bi, . . . ,Bk} from PART(5') on a subset R of 
S is the partition ttr G PART(i?) given by ttr = {Bi DR,... , Sj fl R}. 

If S,T are two disjoint sets and tt G PART(5'),cr G PART(T), then we denote 
hy TT + a the partition of S' U T that consists of all the blocks of tt and cr. It is 
easy to see that “-I-” is an associative partial operation. 

Definition 1. Let tt = {Bi , . . . , Bj-} and a = {Ci, . . . , Ci) he two partitions of 
a set S. The Goodman-Kruskal coefficient of tt and a is the number: 

1 

GK{tt, ct) = 1 - 7^ 

*8 ^ i<3<l 

Decision trees are built from data that has a tabular structure common in 
relational databases. As it is common in the relational terminology (see [11], for 
example), we regard a table as a triple r = {T,H,p), where T is a string that 
gives the name of the table, H = {Ai , . . . , A„} is a finite set of symbols (called 
the attributes of r), and p is a relation, p C Dom(Ai) x • • • x Dom(A„). Here 
Dom(Ai) is the domain of the attribute Ai for 1 < i < n. 

A set of attributes L f- H determines a partion tt^ on the relation p, that is, 
on the set of tuples of the table r, where two tuples belong to the same block if 
they have equal projections on L. It is easy to see that if L, K are two sets of 
attributes, then tt^^ = A tt^ . 

The classical technique for building decision trees is using the entropy gain 
ratio as a criterion for choosing for every internal node of the tree the splitting 
attribute that maximizes this ratio (see [1]). The construction has an inductive 
character. If r = (T, H, p) is the data set used to build the decision tree T, let 
be a node of T that is about to be split and let p„ be the set of tuples that 
corresponds to v. Suppose that the target partition of the data set p is 9. Then, 
the trace of this partition on p^ is 9p^ . 

Choosing the splitting attribute for a node v of a, decision tree T for r based 
on the minimal value of GK(7r;f , 9p^) alone does not yield decision trees with good 
accuracy. A lucid discussion of those issues can be found in [12,4]. However, we 
will show that the GK coefficient can be used to define a metric on the set 
of partitions of a finite set that can be succesfully used for choosing splitting 
attributes. The decision trees that result are smaller, have fewer leaves (and 
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therefore, less fragmentation) compared with trees built by using the standard 
gain ratio criterion; also, they have comparable accuracy. 



2 The Goodman-Kruskal Metric Space 

The main result of this section is a construction of a metric dcK on the set of 
partitions of a finite set that is related to the Goodman-Kruskal coefficient and 
can be used for constructing decision trees. To introduce this metric we need to 
establish several properties of GK. Unless we state otherwise, all sets considered 
here are finite. 

Theorem 1. Let S be a set and let tt, cr € PART(S'). We have GK{Tr,a) = 0 if 
and only if tt < a. 

Proof. It is immediate that n < a implies GK(7t, a) = 0. Conversely, if 
GK(7r,(r) = 0, then I-®* Gl Cj\ = which means that for each 

block Bi of TT, there is a block Cj such that \BiD Cj\ = \Bi\. This is possible 
only if Bi C Cj, that is, if tt < cr, which gives the desired conclusion. 

Theorem 2. The function GK is monotonic in the first argument and dually 
monotonic in the second argument. 

Proof. To prove the first part of the statement let tt = {Bi,... ,Bk}, tt' = 
{B[, . . . , i?^}, and a = {C\, . . . , Ci} be three partitions of S such that tt < tt' . 
Then, for every block of tt' there is a collection of blocks of tt: , . . . , Bi^ 

such that U • • • U Bi^ . Consequently for every m, 1 < m < I we can 

write: 



n Cm\ — \Bii Cm\ + • ■ ■ + \Bi^ n Cm\ 

< max \Bi^ C\Cj\-\ -I- max \Bi^ G\Cj\. 

i<j<i 



Thus, we obtain: 



max \B'^ n Cjl < max fl Cj | -I- • • • -I- max \Bi^ fl Cj\, 



which implies GK(7t, cr) > GK(7t',ct). 

To prove the second part, let a, a' be two partitions such that a < o' . We 
show that GK(7t,ct) > GK(7r,cr'). It suffices to show that a' covers a, that is, 
cr = {Cl, . . . ,Ci_ 2 ,C;_i,C} and a' = jCi, . . . , C_ 2 , C/_iUC;}. In other words, 
the blocks of a' coincide with the blocks of cr with the exception of one block 
that is obtained by fusing two blocks of cr. Note that for a given block Bi of tt 
we have: 



max |i?i n C,| < max{ max fl C,|, |i?i fl (Cj_i U C/)|}, 

i<j<r ■' T<j<i-2 ■’ 



which implies GK(7t, cr) > GK(7r,cr'). 

The next result has a technical character: 



□ 
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Fig. 1. Comparative Experimental Results 

Theorem 3. For every three partitions 9, tt, cr of a finite set S we have 
GK{tt Ad,a) + GK{6, tt) > GK{9, tt A a). 

Proof. See Appendix A. 

Theorem 4. Let 9, tt, a be partitions of a set S. We have 
GK{9,tt) + GK{TT,a) > GK{9,a). 

Proof. Note that 

GK(6», tt) + GK(7t, a) > GK{9, tt) + GK{tt A 9, a) 
due to the monotonicity of GK in its first argument. By Theorem 3 
GK(6»,7r) + GK(7T,a) > GK(6»,7 tA(t) > GK(6i,a), 
because of the dual monotonicity of GK in its second argument. 
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Corollary 1. The mapping dcK '■ PART(5') x PART(S') — >■ K given by: 
dGK{T^, O') = GK{tt, a) + GK{a, tt) 
for 7T, cr G PART(S'), is a metric on the set PART(S'). 

Proof. By Theorem 1 we have dGif(7r, ct) = 0 if and only if tt = ct. Also, the 
definition of doK implies dQx{n^(j) = dGK{(^,T^) for every tt, tr € PART(5'). 

Finally, the triangular inequality dcKi'^,'^) + daKio',9) > dGKi'^,^) for 
TT,a,9 G PART(S') follows immediately from Theorem 4. □ 

3 The Goodman-Kruskal Splitting Criterion for Decision 
Trees 

Let T = (T, H, p) be the table that contains the training data set that is used to 
build a decision tree T. Assume that we are about to expand the node v of the 
tree T. Using the notations introduced in Section 1, we choose to split the note 
V using an attribute that minimizes the distance dcKi'^^f 

The doK metric does not favor attributes with large domains as splitting 
attributes, an issue that is important for building decision trees. 

Theorem 5. Let S be a finite set and let ,a G PART(5') be such that n' < 
7T. If there exists a block C of a and a block B of tt such that B <G C , then 
dGK{T^,o-) < dGif(7r',cr). 

Proof We can assume, without restricting generality, that tt' is covered by tt, 
that is, TT = {Bi, . . . ,Bk}, B = Bj., tt' = {Bi, . . . ,B'^,B'f}, where B^ = B'^VJB'f. 
Also, let a = {Ci, . . . , C}, where Ci = C. 

Theorem 2 implies that GK((t, tt') < GK{a, tt) (due to the dual monotonicity 
in the second argument of GK). We prove that, under the assumptions made in 
the theorem, we have GK(7 t', a) = GK(7t, <t), which implies the desired inequality. 
Indeed, note that: 



GK(7t', a) 




because i?(., B'f Bk G C. □ 

We note that the Theorem 5 is similar to the property of the metric generated 
by the Shannon entropy obtained by L. de Mantaras in [5] and generalized by 
us in [6]. 
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Table 1. Experimental Results: Entropy Gain Ratio vs. dcK 





Entropv Gain Ratio 


dcK 


Dataset 


acc 


tree size 


no. of loafs 


acc 


tree size 


no. of loafs 


anneal 


98.55 


46.4 


36.2 


98.55 


37.2 


26.4 


anneal. OKl<G 


90.20 


63 


42.6 


86.30 


28.2 


17.8 


audiology 


78.76 


46 


29 


77.41 


37.4 


24 


autos 


80 


64.6 


48.2 


67.80 


49.6 


27.6 


balance-scale 


78. 4 


73.8 


37.4 


77.76 


57 


29 


breast-cancer 


73.09 


21.2 


17.2 


73.78 


18 


13.4 


wisc-breas t- cancer 


94.12 


17.4 


7.2 


94.85 


17 


9 


horse-colic 


85.85 


8.4 


5.8 


81.78 


7.6 


4.4 


credit-rating 


86.23 


29.2 


20.8 


83.91 


20.4 


11.6 


german-credit 


72.9 


108 


77.6 


69.5 


63.4 


36.8 


pima-diabetes 


75.65 


42.6 


21.8 


70.96 


88.6 


44.8 


Glass 


67.26 


39.4 


20.2 


70.09 


33.4 


17.2 


clev. -14-heart-disease 


77.53 


41.4 


10.4 


75.89 


16.4 


9 


hung.- 14- heart-disease 


78.57 


9.8 


6.4 


80.28 


10 


6.2 


heart-statlog 


75.55 


26.6 


13.8 


71.85 


17.4 


9.2 


hepatitis 


78.06 


13.4 


7.2 


82.58 


9 


5 


hypothyroid 


99.46 


25.8 


13.4 


99.39 


21 


11 


ionosphere 


89.73 


25.8 


13.4 


88.89 


16.2 


8.6 


iris 


95.33 


8.2 


4.6 


95.33 


6.6 


3.8 


kr-vs-kp 


99.15 


51.8 


27.4 


98.46 


76.4 


39.8 


labor 


78.63 


6.8 


4 


84.09 


3 


2 


lymphography 


80.41 


24.4 


14.8 


79.01 


14.8 


8.8 


mushroom 


100 


29.4 


24.4 


100 


31.8 


25 


primary- tumor 


40.99 


77 


41.2 


43.64 


38.8 


21.4 


segment 


97.09 


81.8 


41.4 


94.02 


67 


34 


sick 


98.75 


42.6 


23.6 


98.35 


18.2 


10.8 


sonar 


74.03 


23.8 


12.4 


69.16 


29.4 


15.2 


soybean 


91.21 


89.4 


58.4 


90.19 


105.2 


71 .2 


splice 


94.04 


199.6 


160.8 


93.51 


194.4 


156.6 


vehicle 


72.10 


117.8 


59.4 


65.60 


128.2 


64.6 


vote 


96.55 


11 


6 


94.71 


4.6 


2.8 


vowel 


78. 18 


200.4 


120.2 


63.43 


235 


125.8 


zoo 


93.09 


14.6 


7.8 


93.09 


14.6 


7.8 


average 


83.92 


50.95 


31.84 


82.25 


45.93 


27.29 



Next, we compare parameters of decision trees constructed oir UCI machiire 
learning datasets [13] by using Entropy Gain Ratio and the Goodman-Kruskal 
distance dcK- The experimeirts have been conducted usiirg the J48 (a variairt of 
G4.5) algorithm from the Weka Package [14], modified to use different splitting 
criteria. The pruniirg steps of decisioir tree constructioir are left unchanged. To 
verify the accuracy, we used 5-fold cross-validation. For each splitting criterion 
we present three characteristics of the geirerated trees: accuracy (perceirtage 
of correctly predicted cases), size of the tree (total number of nodes) and the 
number of leaves iir the tree. All are averaged over the 5-fold of cross-validation. 

Overall dcK produced smaller trees for 24 out of 33 datasets coirsidered. 
In 4 cases (anneal . ORIG, clev . -14-heart-disease , sick, vote) over 50% 
reduction was achieved. Iir oire case (pima-diabetes) a sharp increase was ob- 
served. On average the trees obtained were 10% smaller. 

The accuracy of trees constructed usiirg dcK was on average 1.67% worse 
than that of trees constructed using standard Weka version. In one case (autos) 
the decrease was significant but for all other cases it was rather moderate, and 
in a few cases dcK produced more accurate trees. Small tree size is an advantage 
since, in general, small trees are much easier to understand. The total number 
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of nodes and the number of leaves in the tress were highly correlated so we can 
talk simply about size of the tree. 

The best results obtained from experiments are also shown in Figure 1. 

Splitting nodes by using an attribute A that minimizes GK(7r^^,0p^) instead 
of may result in a substantial loss of accuracy. For example, in 

the case of the hungarian-14-heart-disease dataset, the accuracy obtained 
using GK, under comparable conditions (averaging over 5- fold cross validation) 
is just 70.05% compared to 78.57% obtained by using the entropy gain ratio, or 
80.28% obtained in the case of dcK- This confirms the claim in the literature of 
the unsuitability of using GK(7Tp^,0p^) alone as a splitting criterion. 



References 

1. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San 
Mateo, CA (1993) 

2. Breiman, L., Friedman, J.H., Ohlsen, R.A., Stone, C.J.: Classification and Regres- 
sion Trees. Chapman & Hall/CRC, Boca Raton (1984) Republished 1993. 

3. Utgoff, P.E.: Decision tree induction based on efficient tree restructuring. Technical 
Report 95-18, University of Massachusetts, Amherst (1995) 

4. Utgoff, P.E., Clouse, J.A.: A Kolmogorov-Smirnoff metric for decision tree induc- 
tion. Technical Report 96-3, University of Massachusetts, Amherst (1996) 

5. de Mantaras, R.L.: A distance-based attribute selection measure for decision tree 
induction. Machine Learning 6 (1991) 81-92 

6. Simovici, D.A., Jaroszewicz, S.: Generalized conditional entropy and decision trees. 
In; Proceedings of EGC 2003, Lyon, France (2003) 369-380 

7. Daroczy, Z.: Generalized information functions. Information and Control 16 (1970) 
36-51 

8. Goodman, L.A., Kruskal, W.H.: Measures of Association for Gross-Glassification. 
Volume 1. New York, Springer- Verlag (1980) 

9. Liebtrau, A.M.: Measures of Association. SAGE, Beverly Hills, GA (1983) 

10. Gratzer, G.: General Lattice Theory. Second edn. Birkhauser, Basel (1998) 

11. Simovici, D.A., Tenney, R.L.: Relational Database Systems. Academic Press, New 
York (1995) 

12. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, G.J.: Classification and Regres- 
sion Trees. Chapman and Hall, Boca Raton (1998) 

13. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. Uni- 
versity of California, Irvine, Dept, of Information and Computer Sciences, 
http://www.ics.uci.edu/~mlearn/MLRepository.html (1998) 

14. Witten, I.H., Frank, E.: Data Mining. Morgan-Kaufmann, San Francisco (2000) 

A Proof of Theorem 3 

We begin by showing that if S'!, . . . , S'„ are pairwise disjoint sets, and tt^, Cr- C 

PART(S'r) for 1 < r < n, then 



GK(7Ti H h 7r„,(Ti, . . 



n 

r—1 



1^1 



GK(7Tr, CTr)- 



( 1 ) 
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Let 7Tp = {Sf , . . . , Bf^} and cr^ = {Cf, . . . , for 1 < p,q < n. Then, we can 
write: 



GK(7Ti + •• • + 7T„,(Tl + •• • + a n) 



I^S'I “ q,3 

p,i 



' ' p,i 

(because p ^ q implies Bf C\Cj = 0) 



p=l 



1^1 

\Sp 



— J \Sp\ ^i<j<k 

—1 ' 2=1 



which is the desired equality. 

Let now X(a) be the number: 



1 



X(a) = GK(ws, cr) = 1 - — max |Cj|. 

|o| i<j<fc 

We claim that if tt, cr £ PART (S'), then: 

GK(7T, cr) > X{tt a cr) — lK(7r). 

Let 7T = {Bi , . . . , Bk} and cr = {C \, . . . , Ci}. We can write: 

k 

GK(7t, cr) = |S| — ^ Gl Clfcl 



i=l 



( 2 ) 



i—1 ~ ~ 

l<2<fc 

> max \Bi\— max I Bi fl C,- 1 

i<i<fc i<i<k,i<j<i 

= 3C(7t a cr) — 3C(7 t), 
which proves the inequality (2). 

Let 7T = {i?i, . . . , Bk], 6 = {Di, . . . , Dm} and cr = {Ci, . . . , C;}. We have: 
7T A 0 = 7T£)j + • • ■ + 7T£)^ = 9b^ + • ■ • + Osk ■ 

Consequently, by Equality (1), we have: 

GK(7T a 0, cr) = GK(7r£)j + • • • + ttd^, cr) 

h^l I I 
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Also, we have 



GK{0,n) = f2^-^n^Dj, 

h=l ' ' 



which implies 



^ I r) I 

GK(7t A 6»,cr) + GK(6»,7 t) = (GK(7TD^,cr£,J + 3C(7ri)J) . 

h=l ' ' 

The Inequality (2) implies: 

GK(7ru^, ctdJ + 3C(7ruJ > X{nD^ A ctdJ = X{{'k A cr)z)J, 
so we may conclude that: 

^ I fj I 

GK(7t A 0, (t) + GK(0, 7t) > ^ jC((7r f\(j)oh) = GK(0, tt A cr). 

I I 



/i = l 



DRC-BK: Mining Classification Rules with Help of SVM 



Yang Zhang, Zhanhuai Li, Yan Tang, and Kebin Cui 
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Abstract. Currently, the accuracy of SVM classifier is very high, but the classi- 
fication model of SVM classifier is not understandable by human experts. In 
this paper, we use SVM, which is applied with a Boolean kernel, to construct a 
hyper-plan for classification, and mine classification rules from this hyper- 
plane. In this way, we build DRC-BK, a decision rule classifier. Experiment re- 
sults show that DRC-BK has a higher accuracy than some state-of-art decision 
rule (decision tree) classifiers, such as C4.5, CBA, CMAR, CAEP and so on. 

Keywords: Decision Rule Classifier, SVM, Boolean Kernel 



1 Introduction 

Currently, the classification accuracy of SVM classifier is very high. However, its 
classification model is non-understandable, which is very helpful in some applica- 
tions, such as diagnoses information classification. In this paper, we use SVM, which 
is applied with a Boolean kernel, as a learning engine, and mine decision rules from 
the hyper-plane constructed by SVM, so as to build DRC-BK (Decision Rule Classi- 
fier based on Boolean Kernel), a decision rule classifier. 

To our knowledge, the research to make the classification model of SVM classifi- 
ers understandable is not seen from the literature. The classifier-building algorithm 
and the classification algorithm for some decision tree (rule) classifiers, such as C4.5, 
CBA, CMAR, and CAEP are heuristic, lacking strong mathematical background; while 
DRC-BK mines knowledge from the hyper-plane constructed by SVM, which means 
that DRC-BK is based on the structural risk minimization theory. 



2 Classifier 

When applied with Boolean kernel, SVM could construct an optimal hyper-plane for 
classification by learning Boolean functions in the high dimensional feature space [1]. 
Proposition 1. Suppose JJ g {0,1}", Eg {0,l}",cr> 0, 

K,,,,(U.V) = -\ + t{(cjU,V, + l) is a Boolean 
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Please refer to [1] for the prove and model selection of In our experiment, 

we simply set C to 0.5, set <7 to 0.01, 0.1 or 0.5, and choose the best <7 for mining 
classification rules. 

After , training, the classification function learned by SVM is 

f{X)=s^^a,y,K^„f,lX„X)+b). Here, = (/ is the count of support vectors) 

/=1 

and b are knowledge learned by SVM, which is non-understandable by human ex- 
perts. Here, we present our DRC-BK classifier, which use SVM as its learning engine, 
and mine classification rules from the knowledge learned by the SVM. The detailed 
steps for DRC-BK are as following: 1, Construct a hyper-plane for classification by 
SVM, which is applied with 2, Mining classification rules from this hyper- 

plane. 3, Classify the testing data with the rules learned in step 2. 
Suppose^(^)^^^_^^^^^^^(^_^^^^, , then, we have 

i=l 

1=1 j=l M 

If we look each dimension of the input space as a Boolean literal, then each dimen- 
sion of feature space can be looked as a conjunction of several Boolean literals in the 
input space. Hence, g(x) could be looked as a weighted linear sum of all these con- 
junctions. If the input space has n dimensions, then the feature space has 2"-l dimen- 
sions. However, these dimensions don’t make the same contribution to classification. 
Most of them could be ignored because their contribution is too small. 

Let a non-linear mapping (f> maps X, a vector in input space, to Z, a vector in the 

feature space, then, g(Y) could be written as g(X)=yyF^fZ. -\b- Here, Wz, is the weight 

!=1 

of dimension Z. Then, the contribution to classification made by Z could be measured 
as ^- = W, ■ This means that the bigger |VTz | is, the more contribution to classification 

Z can make; and the smaller \Wz,\ is, the less contribution to classification Z can 
make. This conclusion could be used for feature selection for linear SVM. 

For a conjunction z of j Boolean literals, z=X^pC^^...X^., its weight could be cal- 
culated as: i,sj ^ ' . \Ye make the follo^vlng definition. 

i=l 

Definition 1. Classification Rule: For a classification rule r=<z, wp>, z is a con- 
junction of j Boolean literals (j>0), and is the weight of this rule calculated by the 
above formula. 

Definition 2.j-length Rule: For a rule r=<z, w^>, if z is a Boolean function made 
up of j Boolean literals, then we say that r is a y-length rule (or, a rule with length j). 

In the knowledge learned by SVM, positive support vectors are the sample data 
which satisfy yp-i-1, and negative support vectors are the sample data which satisfy 
y = 1 . Here, we write SVP and SVN for the set of positive support vectors and the set 
of negative support vectors, respectively. Then, ^ X^ X, d 

isSVP 
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^ X. X- <y represents the ability of Boolean function 



z=x„x,,...x„ to 



distinguish positive sample data and negative sample data, respectively. 

Definition 3. Interesting Rule: An interesting rule <z, w> is a rule which satisfies 
> min Weight or > minWeight- Here, minWeight is a user defined pa- 

rameter, representing the minimal weight. 

Please note that «;. > 0 i = 1,2,. .i , so, we have |w^| < and \w^\< |w^ . 

Therefore, if < minWeight and |vr^ < min Weight , then |w^| < min Weight ■ 

Let’s consider the conjunction z=x^jX^^...x^. and z’=x^,x^ 2 ---^sj The length of z is 
j\ the length of z’ is j+l \ and z is contained in z’. The ability for z’ to distinguish posi- 
tive sample data and negative sample data is ,s». = ,s2-" and 

ie.SVP 



, respectively. As we have discussed in section 3.3, 

teSVN 



the value of <7 satisfies (7 < 1 . So, we have w^.^yp<w^gyp and This 

means that for an arbitrary conjunction z’, which contains the conjunction z=x^^x^p..x^p 
z’ doesn’t have a stronger ability to distinguish positive (negative) sample data than z 
does. 

Please refer to figure 1 for the algorithm for mining interesting classification rules. 
In this figure, the function BF{R) is used to calculate the set of conjunctions {z} from 
the rules in rule set f?={<z,w^>}. Figure 2 gives the algorithm for mining classification 
rules. In our experiment, we simply set the parameter minWeight to i>*0.05. Here, b is 
the parameter learned by SVM. Figure 3 gives the classification algorithm for DCR- 
BK. Here, parameter b is the knowledge learned by SVM. 



Algorithm 1: Mining interesting rules from positive (negative) support vectors. 
Input: support vector set SV (could be SVP or SVN), weight vector CX , 

minimal weight minWeight 
Output: interesting rule set 

1, = {(z, rv, ,,) > min Weight] 

2, R^,=0,n=2 



3, 

R„ ={(zz'.i 



|z G BF( R„_i) A z'g BF( RJ a > min Weight] 



5, if R^ = 0 goto 6, else n=n+\, goto 3 

6, Output 



Fig. 1. Algorithm for Mining Interesting Rules 
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Algorithm 2: Mining classification rules. 

Input: support vector set SVP and SVN, weight vector CC , 

minimal weight minWeight 
Output: classification rule set R 

1, Following algorithm 1, mine interesting rules set from SVP 

2, Following algorithm 1, mine interesting rules set R^^ from SVN 

3, R = {< r.w^ >lrs BF ( Rp u )} 

4, Output R 



Fig. 2. Algorithm for Mining Classification Rules 



Algorithm 3: Classification Algorithm 
Input: rule set R, testing data sample, parameter b 

Output: the class type of testing data sample 

\,f=b 

2, for each rule in R 

if rule.z matches sample then f=f+rule.w 

3, Output sgn(f) 

Fig. 3. Classification Algorithm 



Table 1. The comparision of classificaion accuracy of DRC-BK and 9 other classifiers. 



NAME 


#INS 


#AT 

TR 


#BO 

OL 


C4.5 


CBA 


CM 

AR 


DE 

ep 


CAEP 


SVM 


DRC- 

BK 


#RUL 

E 


TIME 


AUSTRA 


690 


14 


50 


84.7 


84.9 


86.1 


84.78 


86.21 


84.49 


85.36 


1667 


3.461 


DIABETES 


768 


8 


15 


74.2 


74.5 


75.8 


76.82 


X 


77.73 


79.04 


1072 


1.032 


GERMAN 


1000 


200 


60 


72.3 


73.4 


74.9 


74.4 


72.5 


74.9 


75.2 


93 


2.284 


HEART 


270 


13 


18 


80.8 


81.9 


82.2 


81.11 


83.7 


81.48 


83.33 


815 


0.811 


lONO 


351 


34 


143 


90 


92.3 


91.5 


86.23 


90.04 


90.03 


90.6 


115 


0.252 


PIMA 


768 


8 


15 


75.5 


72.9 


75.1 


76.82 


75 


77.21 


77.86 


96 


0.148 


SONAR 


208 


60 


42 


70.2 


77.5 


79.4 


84.16 


X 


87.02 


85.58 


591 


0.758 


TIC-TAC 


958 


9 


27 


99.4 


99.6 


99.2 


99.06 


99.06 


98.33 


99.79 


9623 


42.63 


BREAST 


699 


10 


30 


95 


96.3 


96.4 


96.42 


97.28 


96.42 


96.85 


321 


0.332 


CLEVE 


303 


13 


29 


78.2 


82.8 


82.2 


87.17 


83.25 


83.17 


83.17 


1495 


3.189 


CRX 


690 


15 


61 


84.9 


84.7 


84.9 


84.18 


X 


86.24 


85.51 


993 


1.852 


HEPATIC 


155 


19 


46 


80.6 


81.8 


80.5 


81.18 


83.03 


85.81 


86.45 


293 


0.334 


HORSE 


368 


22 


78 


82.6 


82.1 


82.6 


84.21 


X 


82.34 


84.51 


956 


1.87 


HYPO 


3163 


25 


57 


99.2 


98.9 


98.4 


97.19 


X 


99.34 


97.5 


289 


0.534 


LABOR 


57 


16 


41 


79.3 


86.3 


89.7 


87.67 


X 


92.98 


92.98 


1984 


2.545 


SICK 


2800 


29 


63 


98.5 


97 


97.5 


94.03 


X 


97.29 


97.32 


1740 


4.548 


AVERAGE 


828 


30.9 


48.4 


84.1 


85.4 


86.03 


85.96 


85.56 


87.17 


87.57 


1384 


4.161 



3 Experiment Results 



In order to compare the classification accuracy of DRC-BK with other classifiers, we 
made experiment on 16 binary datasets from UCI dataset. We made our experiment in 
the 10-fold cross validation way, and report the average classification accuracy. Table 
1 gives the experiment result. In table 1, column 1 lists the name of 16 datasets used 
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in our experiment. Column 2, 3, and 4 gives the number of samples, the number of 
attributes in the original dataset, and the number of attributes after pre-processing, re- 
spectively. Column 5, 6, and 7 give the classification accuracy of C4.5, CBA, and 
CMAR, respectively. These experiment results are copied from [2]. Column 8, 9, and 
10 gives the classification accuracy of DEep, CAEP, and linear SVM, respectively. 
Column 11, 12, and 13 gives the classification accuracy, the number of rules, and the 
executing time for DRC-BK to mine classification rules from the hyper-plane. From 
table 1, we can see that DRC-BK the best average classification accuracy among 
the 7 classifiers. 



4 Conclusion and Future Work 

In this paper, we present a novel decision rule classifier, DRC-BK, which has high 
classification accuracy and makes it possible to study the generalization error of deci- 
sion rule classifiers quantitatively in the future. In order to refine the rule set, our fu- 
ture research is to find a better Boolean kernel for mining classification rules. 
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A New Data Mining Method Using Organizational 
Coevolutionary Mechanism 



Jing Liu, Weicai Zhong, Fang Liu, and Licheng Jiao 
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Abstract. Organizational coevolutionary algorithm for classification (OCEC), 
is designed with the intrinsic properties of data mining in mind. OCEC makes 
groups of examples evolved, and then rules are extracted from these groups of 
examples at the end of evolution. OCEC is first compared with G-NET and 
JoinGA. All results show that OCEC achieves a higher predictive accuracy. 
Then, the scalability of OCEC is studied. The results show that the classifica- 
tion time of OCEC increases linearly. 



1 Introduction 

A new classification algorithm, organizational coevolutionary algorithm for classifi- 
cation (OCEC) is designed to deal with the classification task in data mining. The new 
approach adopts the coevolutionary model of multiple populations, focusing on ex- 
tracting rules from examples. The main difference between it and the existed EA- 
based classification algorithms is its use of a bottom-up search mechanism. This ap- 
proach makes groups of examples evolved, and then rules are extracted from these 
groups of examples at the end of evolution. 



2 An Organizational Coevolutionary Algorithm for Classification 

In order to avoid confusion about terminology, some concepts are explained. 
DEFINITION I. Let be a set of attribute values . An instance 

space I is the Cartesian product of sets of attribute values, X = ^, x...x^„ . An 

attribute A.\I^ A. is a projection function from the instance space to a set of at- 
tribute values. An instance i is an element of I , f <z X xC a set of examples and an 
example e an element of XxC , where C class name set. 

DEFINITION 2. An Organization, org, is a set of examples belonging to the same 
class and the intersection of different organizations is empty. The examples in an or- 
ganization are called Members. 

DEFINITION 3. If all members of org are of the same value for attribute A, then A is 
a Fixed-value Attribute; if A' is a fixed-value attribute and satisfies the condi- 
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tions required for rule extraction, then X is a Useful Attribute. These conditions 
will be explained later, in ALGORITHM 1 . The set of fixed-value attributes of org is 
labeled as F , and that of useful attributes is labeled as U . 

org^ org 

Organizations are also divided into three classes: 

Trivial Organization; An organization whose number of members is 1, and all 
attributes of such an organization are useful ones; 

Abnormal Organization: An organization whose set of useful attributes is empty; 

Normal Organization: An organization does not belong to the two classes above. 

The sets of the three kinds of organizations are labeled as ORGj, ORG^ and ORG^. 

Given that two parent organizations, org^^ and org^^, are randomly selected from the 
same population: 

Migrating operator: n members randomly selected from org^^ are moved to org^^, 
with two child organizations, org^^ and org^^, obtained. 

Exchanging operator: n members randomly selected from each parent organiza- 
tion are exchanged, with two child organizations, org^^ and org^^, obtained. Here 
l<n<min{|orgpJ, \org^^\}, where \org\ denotes the number of members in org. 

Merging operator: The members of the two organizations are merged, with one 
child organization, org^^, obtained. 

Organizational selection mechanism: After an operator creates a pair of new or- 
ganizations, a tournament will be held between the new pair and the parent pair. The 
pair containing the organization with the highest fitness survives to the next genera- 
tion, while the other pair is deleted. If child organizations survive to the next genera- 
tion, and one of them is an abnormal organization, then it is dismissed and its mem- 
bers are added to the next generation as trivial organizations. If only one organization 
remains in a population, it will be passed to the next generation directly. 

A measure. Attribute Significance, is introduced to determine the significance of 
an attribute. The significance of attribute A is labeled as S^. All populations use the 
same S^. The value of will be updated each time the fitness of an organization is 
computed, and the method is shown in ALGORITHM 1 . 

ALGORITHM 1 . Attribute significance 

t denotes the generation of evolution, the number of attributes is m, A is a prede- 
fined parameter, and org is the organization under consideration, orgi ORGj. 



Step2: For each Ab randomly select a population without org, and an organi- 
zation orgj from it. If A.e F^^^, and the attribute value of A. in org^ is different from that 
of org, then U otherwise reduce S\ according to (1) (Casel); 

Step3: If U^^=0, stop; otherwise randomly select N examples from the classes 
without org. If the combination of the attribute values in does not appear in the N 
examples, then increase the attribute significance of all the attributes in according 
to (1) (Case2); otherwise, 



The value of is restricted to the range of [0.5, 2]. The conditions of Casel and 
Case2 are the ones required for rule extraction in DEFINITION 3. 



Stepl: If t=0, then 5^ 1.0, / = l,2,...,«i ; U^^<^0', 




0.95^-1-0.05 Casel 

0.95' -1-0.2 Case2 



( 1 ) 
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The fitness function is defined in (2), where A. denotes the ith attribute in 



Fitness{org) = < 



0 

-1 



orgG ORGj^ 
orge ORG^ 



( 2 ) 



|o'-gin!!7'^4 orgeORG^ 

The whole algorithm of OCEC is given in ALGORITHM 2. 

ALGORITHM 2. Organizational coevolutionary algorithm for classification 

Stepl: For each example, if its class name is c,, l<i<m, then add it to population 
as a trivial organization; r<— 0 , 7 '<— 1; 

Step2: If j>m, go to Step 7; 

Step3: If the number of organizations in Pj is greater than I, then go to Step4; 
otherwise go to Step6; 

Step4: Randomly select two parent organizations, org^^ and org^^, from Pj , and 



then randomly select an operator from the three evolutionary operators to act on org^^ 
and org^^\ update the attribute significance according to ALGORITHM I and compute 
the fitness of child organizations, org^^ and org^^, 

StepS: Perform the selection mechanism on org^^, org^^ and org^^, org^^, delete org^^, 
org^^ from P ' , and go to Step3; 



Step6: Move the organization left in Pj to and go to Step2; 



Step?: If stop conditions are reached, stop; otherwise, go to Step2. 

When the evolutionary process is over, rules will be extracted from organizations. 
First, all organizations are merged as follows: Merge any two organizations in an 
identical population into a new organization if one set of useful attributes is a subset 
of the other set. The members of the new organization are those of the two organiza- 
tions, and its set of useful attributes is the intersection of the two original sets. Next, a 
rule will be extracted from each organization based on its set of useful attributes. Each 
useful attribute forms a condition, and the conclusion is the class of the organization. 
Then, the ratio of positive examples a rule covers to all examples in the class the rule 
belongs to is calculated for each rule. Based on the ratio, all rules are ranked. Finally, 
some rules will be deleted: If the set of examples covered by a rule is a subset of the 
union of examples covered by some rules before this rule, the rule will be deleted. 



3 Experimental Evaluation 

1 1 datasets in the UCI repository* are used to test the performance of OCEC. The pa- 
rameters of OCEC are: the number of generations is 500 for the datasets whose num- 
ber of examples is less than 1000, and 1000 for the ones whose number of examples is 
larger than 1000. N is set to 10 percent of the number of examples for each dataset, 
and n is set to 1 Table 1 shows the comparison between OCEC and G-NET [1]. As 
can be seen, the predictive accuracies of OCEC on 7 of the 8 datasets are equivalent 
to or higher than those of G-NET. Table 2 shows the comparison between OCEC and 



* http://www. ics.uci.edu/~mleam/MLReDositorv.html 
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JoinGA [2]. As can be seen, the predictive accuracies of OCEC are equivalent to or 
higher than those of JoinGA on all the 4 datasets. 



Table 1. Comparison between OCEC and G-NET 





Monkl 


Monk2 


Monk3 


Tictac- 

toe 


Credit 


Breast cancer 
(W) 


Vote 


Mush- 

rooms 


G-NET 


100 . 00 + 

0.00 


97 . 20 ± 

3.80 


100 . 00 ± 

0.00 


99.03± 

0.62 


o o 
00 


94.71± 

2.89 


94.90± 

3.20 


100.00 


OCEC 


100 . 00 ± 


73.18+ 


100 . 00 ± 


100 . 00 ± 


87 . 97 ± 


96 . 13 ± 


95 . 87 ± 


100 . 00 ± 


0.00 


7.31 


0.00 


0.00 


4.38 


2.03 


2.61 


0.00 



Table 2. Comparison between OCEC and JoinGA 





Australian 


Lymphography 


Chess (KR-vs-KP) 


Mushrooms 


JoinGA 


84.913.7 


82.416.3 


99.4 


100.0 


OCEC 


87 . 9714.04 


86 . 3818.92 


99 . 5110.09 


100 . 0010.00 




Fig. 1. (a) The scalability of OCEC on the number of training examples, (b) the scalability of 
OCEC on the number of attributes 



4 The Scalability of OCEC 

The evaluation methodology and synthetic datasets proposed in [3] are used. The pa- 
rameters of OCEC are: the number of generations is 5000 for all datasets, and n is se- 
lected from 1~5 randomly. In order to test the predictive accuracy of the obtained 
rules, another 10,000 instances are generated for each function as the test set. 

Eig.4(a) shows the performance of OCEC as the number of training examples is in- 
creased from 100,000 to 10 million in steps of 1 100,000. The results show that OCEC 
can lead to linear classification time. Even when the number of training examples is 
increased to 10 million, the classification time is still shorter than 3500s. 

Eig.4(b) shows the performances OCEC as the number of attributes is increased 
from 9 to 400 in steps of 39, where the number of training examples is 100,000. The 
results show that OCEC still leads to linear classification time. When the number of 
attributes is increased to 400, the classification time is still shorter than 1400s. 
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Abstract. Classification is an important data mining problem. A 
desirable property of a classifier is noise tolerance. Emerging Patterns 
(EPs) are itemsets whose supports change significantly from one 
data class to another. In this paper, we first introduce Chi Emerging 
Patterns (Chi EPs), which are more resistant to noise than other 
kinds of EPs. We then use Chi EPs in a probabilistic approach for 
classification. The classifier, Bayesian Classification by Chi Emerging 
Patterns (BCCEP), can handle noise very well due to the inherent noise 
tolerance of the Bayesian approach and high quality patterns used in 
the probability approximation. The empirical study shows that our 
method is superior to other well-known classification methods such as 
NB, C4.5, SVM and JEP-C in terms of overall predictive accuracy, on 
“noisy” as well as “clean” benchmark datasets from the UCI Machine 
Learning Repository. Out of the 116 cases, BCCEP wins on 70 cases, 
NB wins on 30, C4.5 wins on 33, SVM wins on 32 and JEP-C wins on 21. 

Keywords: Emerging patterns, noise tolerance, classification, Bayesian 
learning, noise 



1 Introduction 

Classification is an important data mining problem, and has also been studied 
substantially in statistics, machine learning, neural networks and expert systems 
over decades [1]. Data mining is typically concerned with observational retrospec- 
tive data, i.e., data that has already been collected for some other purpose. For 
many reasons such as encoding errors, measurement errors, unrecorded causes 
of recorded features, the information in a database is almost always noisy. The 
problem of dealing with noisy data is one of the most important research and 
application challenges in the field of Knowledge Discovery in Databases (KDD). 
There are three kinds of noise in the training data: attribute noise (wrong at- 
tribute values), label noise, also called classification noise (wrong class labels), 
and mix noise (both classification and attribute noise) . The noise in the training 
data can mislead a learning algorithm to fit it into the classification model. As 
a result, the classifier finds many meaningless “regularities” in the data. The 
phenomenon is often referred to as overfitting. In this paper, we address the 
following question: how can a classifier cope with noisy training data. Our main 
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contributions are the identification of the most discriminating knowledge - Chi 
Emerging Patterns (Chi EPs), and the probabilistic approach to use Chi EPs 
for classification to resist noise. 



2 Chi Emerging Patterns and Related Work 

Emerging Patterns [2] are conjunctions of simple conditions, where each conjunct 
is a test of the value of one of the attributes. EPs are defined as multivariate 
features (i.e., itemsets) whose supports (or frequencies) change significantly from 
one class to another. A JEP is a special type of EP, defined as an itemset whose 
support increases abruptly from zero in one dataset, to non-zero in another 
dataset - the ratio of support-increase being infinite. Since EPs/JEPs capture 
the knowledge of sharp differences between data classes, they are very suitable 
for serving as a classification model. By aggregating the differentiating power of 
EPs/JEPs, the constructed classification systems [3,4] are usually more accurate 
than other existing state-of-the-art classifiers. 

Recently, the noise tolerance of EP-based classifiers such as CAEP and JEPC 
has been studied using a number of datasets from the UCI Machine Learning 
Repository where noise is purposely introduced to the original datasets [5] . The 
results shows that both CAEP and JEPC do not experience overfitting due to 
the aggregating approach used in the classification and they are generally noise 
tolerant. Their comparison of the learning curves of a number of classifiers shows 
that JEPC and NB are the most noise tolerant, followed by C5.0, CAEP and 
kNN. However, there are difficulties to apply JEPC on noisy datasets. On one 
hand, by definition, an itemset which occurs once (or very few times) in one 
data class while zero times in another class is a JEP. Such JEPs are usually 
regarded as noise information and are not useful for classification . The number 
of those useless JEPs can be very large due to the injection of noise, which not 
only cause lots of difficulties to the mining of JEPs, but also makes JEPC very 
inefficient or even unusable (although the number of the most expressive JEPs 
is usually small, we have observed that such number becomes exponential for 
some noisy datasets). On the other hand, by definition, an itemset with large 
but finite support growth rate is not a JEP. The number of JEPs for some noisy 
datasets can be very small or even zero, because of the strict requirement that 
JEPs must have zero support in one class. Large-growth-rate EPs are also good 
discriminators to distinguish two classes. The exclusive of using JEPs makes 
JEPC very unreliable when there are few JEPs to make a decision. 

To overcome the problems of JEPC, we propose a new kind of Emerging 
Patterns, called Chi Emerging Patterns (yEPs). 

Definition 1. An itemset X is called an Chi emerging pattern (Chi EP), if all 
the following conditions are true: 

1. supp{X) > where is a minimum support threshold; 

2. GR{X) > p, where p is a minimum growth rate threshold; 
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3. -•3Y{Y C X) A (supp{Y) > ^) A (GR{Y) > p) A {strengthiY) > 
strength(X)); 

4. |X| = 1 V |X| > 1 A (vr(r C X A\Y\ = |X| - l) ^ chi{X, Y) > g), where 
g = 3.84 is a minimum chi-value threshold and chi(X,Y) is computed using 
the following contingency table [6] 





X 


Y 


row 


Di 


count (AT) 


count (Y) 


count£!^{X) count UifY) 


D 2 


count D^{X) 


count (F) 


count]o^{X) countD^iY) 


^ column 


countD^+oAX) 


countDi-\-D2{Y) 


count D^JrD 2 {X) count DiJrD 2 {Y) 



Chi EPs are high quality patterns for classification and are believed to resist 
noise better because of the following reasons. The first condition ensures a xEP 
is not noise by imposing a minimum coverage on the training dataset. The sec- 
ond requires that a yEP has reasonably strong discriminating power, because 
larger growth rate thresholds produces very few EPs and smaller thresholds 
generates EPs with less discriminating power. The third prefers those short EPs 
with large strength: subsets of xEPs may satisfy condition (1) and (2), but they 
will not have strong discriminating power; Super sets of %EPs are not regarded 
as essential because usually the simplest hypothesis consistent with the data is 
preferred. Generally speaking, the last condition states that an itemset is a yEP, 
if the distribution (namely, the supports in two contrasting classes) of its subset 
is significantly different from that of the xEP itself, where the difference is mea- 
sured by the x^-test [6]. In other words, every item in the itemsets contributes 
significantly to the discriminating power of the xEP- Chi EPs can be efficiently 
discovered by the tree-based pattern fragment growth methods [7]. 

3 Bayesian Classification by Chi Emerging Patterns 

The Naive Bayes (NB) classifier [8] has been shown inherently noise tolerant 
due to its collection of class and conditional probabilities. The main weakness 
of NB is the assumption that all attributes are independent given the class. Our 
previous work [9] shows that extending NB by using EPs can relax the strong 
attribute independence assumption. In this paper, we propose to use Chi EPs 
in the probabilistic approach for classification. The classifier is called Bayesian 
Classification by Chi Emerging Patterns (BCCEP). BCCEP can handle noise 
very well, because (1) it retains the noise tolerance of the Bayesian approach; 
(2) it uses high quality patterns (xEPs) of arbitrary size in the probability 
approximation, which overcomes NB’s weakness and provides a better scoring 
function than previous EP-based classifiers such as CAEP and JEPC. The details 
about the implementation of BCCEP can be found in [10]. 

4 Experimental Evaluation 

We carry experiments on 29 datasets from the UCI Machine Learning 
Repository [11]. We regard the original datasets downloaded from UCI as “clean” 
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datasets, although they are not guaranteed to be free from noise. For each 
dataset, we inject three kinds of noise at the level of 40% to generate three 
noisy datasets, namely “attribute noise” dataset, “label noise” dataset and “mix 
noise” dataset. The details about the implementation of noise generation can be 
found in [10]. Note that when we evaluate the performance under noise, the test 
datasets do not contain injected noise; only the training datasets are affected by 
noise. We compare BCCEP against Naive Bayes, decision tree induction C4.5, 
Support Vector Machines (SVM), and JEP-C [4]. We use WEKA’s Java imple- 
mentation of NB, C4.5 and SVM [12]. All experiments were conducted on a Dell 
PowerEdge 2500 (Dual P3 IGHz CPU, 2G RAM) running Solaris 8/x86. The 
accuracy was obtained by using the methodology of stratified ten-fold cross- 
validation (CV-10). We use the Entropy method from the MLC-I— I- machine 
learning library [13] to discretize datasets containing continuous attributes. 



Table 1. Performance Comparison 





# Times as Top Classifier 


Average Accuracy 




NB 


C4.5 


SVM 


JEP-C 


BCCEP 


NB 


C4.5 


SVM 


JEP-C 


BCCEP 


Clean 


4 


10 


11 


7 


19 


80.18 


84.68 


85.99 


83.75 


87.3 


Attribute 


9 


7 


5 


11 


21 


76.94 


78.42 


76.71 


79.98 


82.29 


Label 


5 


11 


10 


2 


13 


76.78 


77.18 


81.66 


74.94 


82.27 


Mix 


12 


5 


6 


1 


17 


74.86 


70.88 


73.02 


66.42 


76.62 


Total 


30 


33 


32 


21 


70 


- 


- 


- 


- 


- 


Average 


7.5 


8.25 


8 


5.25 


17.5 


77.19 


77.79 


79.345 


76.2725 


82.12 



A classifier is regard as top classifier when (1) it achieves the highest accuracy; or 
(2) its accuracy is very close to the highest one (the difference is less than 1%) 



For lack of space, we only present a summary of results shown in Table 1. 

More results can be found in [10] . We highlight some interesting points as follows: 

— The average accuracy of NB on clean, attribute noise and label noise datasets 
are lower than other classifiers. But NB deals with mix noise surprisingly 
well, when other classifiers are confused by the two- fold noise. 

— G4.5 is fast in all cases, clean, attribute noise, label noise or mix noise. There 
is a big drop of accuracy on mix noise datasets. 

— SVM is much slower than G4.5, especially when datasets are large. It deals 
with noise fairly well: its average accuracy is very close to BGGEP on label 
noise datasets. But unfortunately we were unable to produce results for some 
large datasets such as Adult, Shuttle and Splice using the SVM implementa- 
tion from WEKA [12], as the SVM program does not terminate or exceeds 
the memory limit. 

~ JEP-G performs well on clean datasets. When attribute noise, label noise and 
mix noise is introduced, it is harder and harder to mine JEPs, generating 
either too many JEPs or too few JEPs, leading to decreased accuracy. 

— Our BGGEP deals with all three kinds of noise very well, as evidenced both 
by the highest accuracy and the number of times being top classifiers. Differ- 
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ent classifiers are good at handling different kinds of noise. We believe that 
the success of BCCEP on all three kinds of noise is mainly due to its hybrid 
nature, combining Bayesian and EP-based classification. 

— The accuracy of BCCEP is almost always higher than its ancestor NB. The 
tree-based EP mining algorithm used in BCCEP mines a relatively small 
number of yEPs very fast on datasets where finding JEPs is very hard, i.e., 
JEPC uses up 1GB memory and gives up. 

5 Conclusions 

In this paper, we have shown that Chi Emerging Patterns resist noise better 
than other kinds of Emerging Patterns. The classifier, Bayesian Classification 
by Chi Emerging Patterns (BCCEP), combines the advantages of the Bayesian 
approach (inherent noise tolerant) and the EP-based approach (high quality 
patterns with sharp discriminating power). Our extensive experiments on 29 
benchmark datasets from the UCI Machine Learning Repository show that BC- 
CEP has good overall predictive accuracy on “clean” datasets and three kinds 
of noisy datasets; it is superior to other state-of-the-art classification methods 
such as NB, C4.5, SVM and JEP-C: out of 116 cases (note that there are 29 
datasets, each has four versions, namely, clean, attribute noise, label noise and 
mix noise datasets), BCCEP wins on 70 datasets, which is much higher than any 
other classifiers, (as comparison, NB wins on 30, C4.5 wins on 33, SVM wins on 
32 and JEP-C wins on 21) 
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Abstract. The classification of rare cases is a challenging problem in many real 
life applications. The scarcity of the rare cases makes it difficult for traditional 
classifiers to classify them correctly. In this paper, we propose a new approach 
to use emerging patterns (EPs) [3] in rare-class classification (EPRC). 
Traditional EP-based classifiers [2] fail to achieve accepted results when 
dealing with rare cases. EPRC overcomes this problem by applying three 
improving stages: generating new undiscovered EPs for the rare class, pruning 
low-support EPs, and increasing the supports of the rare-class EPs. An 
experimental evaluation carried out on a number of rare-class databases shows 
that EPRC outperforms EP-based classifiers as well as other classification 
methods such as PNrule [1], Metacost [6], and C4.5 [7]. 



1 Introduction 

Classification of rare cases is an important problem in data mining. This problem is 
identified as distinguishing rarely-occurring samples from other overwhelming 
samples in a significantly imbalanced dataset [1]. In this paper, we investigate how to 
employ emerging patterns (EPs) in rare-case classification. EPs are a new kind of 
patterns that introduced recently [3]. EPs are defined as itemsets whose supports 
increase significantly from one class to another. The power of EPs can be used to 
build high-performance classifiers [2]. Usually, these classifiers achieve higher 
accuracies than other state-of-the-art classifiers. However, simple EP-based classifiers 
do not retain their high performance when dealing with datasets which have rare 
cases. The reason for this failure is that the number of the rare-class EPs is very small, 
and their supports are very low. Hence, they fail to distinguish rare cases from a vast 
majority of other cases. 

In this paper we propose a new approach to use the advantage of EPs to classify 
rare cases in imbalanced datasets. The aim of our approach (called EPRC) is to 
improve the discriminating power of EPs so that they achieve better results when 
dealing with rare cases. This is achieved through three improving stages; 1) 
generating new undiscovered EPs for the rare class, 2) pruning the low-support EPs, 
and 3) increasing the support of rare-class EPs. These stages are detailed in section 3. 
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In this paper we adopt the weighted accuracy [8], and the f-measure [ 9 ] as they are 
well-known metrics for measuring the performance of rare-case classification. The f- 
measure depends on the recall and precision of the rare class. 



2 Emerging Patterns 

Let obj = (a,, a^ a^, ... aj is a data object following the schema (A^, A^, ... AJ. A,, 

Aj Aj.... A^ are called attributes, and a^, a^ a^ ... a^ are values related to these 
attributes. We call each pair (attribute, value) an item. 

Let I denote the set of all items in an encoding dataset D. Itemsets are subsets of I. 
We say an instance Y contains an itemset X, if X C Y. 

Definition 1. Given a dataset D, and an itemset X, the support of X in D is defined as 
the percentage of the instances in D that contain X. 



Definition 2. Given two different classes of datasets D1 and D2. The growth rate of 
an itemset X from D1 to D2 is defined as the ratio between the support of X in D2 and 
its support in Dl. 



Definition 3. Given a growth rate threshold p>l, an itemset X is said to be 
a p-emerging pattern (p-EP or simply EP) from Dl to D2 if 
GrowthRate p 



3 Improving Emerging Patterns 

As described earlier, our approach aims at using the discriminating power of EPs in 
rare-case classification. We introduce the idea of generating new EPs for the rare 
class. Moreover, we adopt eliminating low-support EPs in both the major and rare 
classes, and increasing the support of rare-class EPs. 

The first step in our approach involves generating new rare-class EPs. Given a 
training dataset and a set of the discovered EPs, the values that have the highest 
growth rates from the major class to the rare class are found. The new EPs are 
generated by replacing different attribute values (in the original rare-class EPs) with 
the highest-growth-rate values. After that, the new EPs that already exist in the 
original set of EPs are filtered out. Figure 1 shows an example of this process. The 
left table shows four rare-class EPs. Suppose that the values that have the highest 
growth rates for attributes A 1 and A 3 are V„ and V33 respectively. Using these two 
values and EP e 4 , {Vjj, X, Vj^, V,,}, we can generate 2 more EPs (in the right 
table). The first EP is {Vj^, X, V3^, V_„, V53} (by replacing V|3with V|j). The second 
EP is {Vj3, X, V33, V55} (by replacing V34 with V33). However, the first new EP 

already exists in the original set of EPs (el). This EP is filtered out. We argue that 
these new generated EPs have a strong power to discriminate rare-class instances 
from major-class instances. There are two reasons for this argument. The first reason 
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is that these new EPs are inherited from the original rare-class EPs which themselves 
have a discriminating power to classify rare cases. The second reason is that they 
contain the most discriminating attribute values (attribute values with the highest 
growth rates) obtained from the training dataset. 



Fig. 1. Example of generating new rare-class Eps 
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* Vij = value) for attribute i 

* X = undefined value 



Based on the above explanation, we have algorithm 1 to generate new rare-class 
EPs. 

Algorithm 1 (Generating new rare-class EPs) 

Input: the training dataset D, and discovered EPs E. 
Output: a set of new rare-class EPs . 

Method : 

For each attribute i in D 

Ai = value with the highest growth rate of attribute i. 
For each rare-class EP e 

For each attribute value k related to i 
If k != Ai 

Generate a new EP e^ew = e 
Replace k by in e^ew 
If Cnew does not exist in E 

The second step involves pruning the low- support EPs. This is performed for both 
the major and rare classes. Given a pruning threshold, the average growth rate of the 
attribute values in each EP is found. The EPs whose average growth rates are less 
than the given threshold are eliminated. We argue that these eliminated EPs have the 
least discriminating power as they contain many least-occurring values in the dataset. 

The third step involves increasing the support of rare-class EPs by a given 
percentage. The postulate behind this point is that this increase compensates the effect 
of the large number of major-class EPs. That is, the overwhelming major-class EPs 
make many rare-class instances classified as major-class. 



4 Experimental Evaluation 

In order to investigate the performance of our approach, we carry out a number of 
experiments. We use three challenging rare-class databases with different 
distributions of data between the major and rare classes. These datasets are the 
insurance dataset [5], the disease dataset [4], and the sick dataset [4]. We compare our 
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approach to other methods such as PNrule, Metacost, C4.5, and CEP. The comparison 
we present is based on the weighted accuracy, traditional accuracy, major-class 
accuracy, recall (rare-class accuracy), precision, and F-measure. 



4.1 Tuning 

Our approach uses three parameters. These parameters are the threshold of pruning 
the major-class EPs, the threshold of pruning the rare-class EPs, and the percentage of 
increasing the support of rare-class EPs. The parameters of our approach need to be 
tuned to achieve the best results. To achieve this aim, we split the training set into 2 
partitions. The first partition (70%) is used to train the classifier. The second partition 
(30%) is used to tune the parameters. 



4.2 Comparative Results 

After tuning the insurance dataset, the parameters of our approach are fixed to deal 
with this dataset. We run different methods on the test set of this dataset. These 
methods include EPRC (our approach), PNrule [1], C4.5 [7], Metacost [6], and CEP 
(EP-based classifier) [2]. The results of these methods are presented in table 1. Our 
approach outperforms all other methods in terms of the weighted accuracy and f- 
measure. This performance is achieved by balancing both the recall and the precision. 



Table 1. The results of the insurance dataset 



Classifier 


Weighted 

accuracy 


Iraditional 

accuracy 


Major-class 

accuracy 


Recall (rare- 
class accuracy) 


Precision 


F-measure 


EPRC 


63 . 89 »/„ 


80.57% 


82.82% 


44.95% 


14.20% 


21 . 59 % 


PNrule 


58.91% 


87.12% 


90.03% 


26.89% 


15.80% 


19.90% 


C4.5 


52.87% 


90.95% 


96.09% 


9.66% 


13.52% 


1 1.27% 


Metacost 


49.80% 


5.95% 


0.02% 


99 . 57 % 


5.92% 


11.18% 


CEP 


50.85% 


93 . 8 % 


99 . 60 % 


2.10% 


25 . 00 % 


3.87% 



4.3 The Effects on the EP-Based Classifier 

Our basic aim behind the work presented in this paper is to improve the performance 
of EP-based classifiers in rare-case classification. In this experiment we compare the 
results obtained for our three datasets using CEP (EP-based classifier) and EPRC. As 
stated in section 2, EPRC uses CEP as its basic EP-based classifier. The three 
datasets were tuned using 30% of the training set. Table 2 shows how our approach 
enhances the performance of the EP-based classifier. There are significant increases 
in the weighted accuracy and f-measure from CEP to EPRC. 
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Table 2. The effect on the EP-based classifier 



Experiment 


Weighted accuracy 


F-measure 


Insurance dataset (CEP) 


50.85% 


3.87% 


Insurance dataset (EPRC) 


63.89% 


21.59% 


Disease dataset (CEP) 


49.94% 


Undefined 


Disease dataset (EPRC) 


65.07% 


34.78% 


Sick dataset (CEP) 


78.89% 


70% 


Sick dataset (EPRC) 


94.57% 


79.71% 



5 Conclusions and Future Research 

In this paper, we propose a new EP-based approach to classify rare classes. Our 
approach, called EPRC, introduces the idea of generating new rare-class EPs. 
Moreover, it improves EPs by adopting pruning low-support EPs, and increasing the 
support of rare-class EPs. We empirically demonstrate how improving EPs enhances 
the performance of EP-based classifiers in rare-case classification problems. 
Moreover, our approach helps EP-based classifiers outperform other classifiers in 
such problems. The proposed approach opens many doors for further research. One 
possibility is improving the performance of EP-based classifiers further by adding 
further improving stages to increase the discriminating power of EPs. 
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Abstract. Pattern discovery in a long temporal event sequence is of 
great importance in many application domains. Most of the previous 
work focuses on identifying positive associations among time stamped 
event types. In this paper, we introduce the problem of defining and 
discovering negative associations that, as positive rules, may also serve 
as a source of knowledge discovery. 

In general, an event-oriented pattern is a pattern that associates with a 
selected type of event, called a target event. As a counter-part of previous 
research, we identify patterns that have a negative relationship with the 
target events. A set of criteria is defined to evaluate the interestingness 
of patterns associated with such negative relationships. In the process of 
counting the frequency of a pattern, we propose a new approach, called 
unique minimal occurrence, which guarantees that the Apriori property 
holds for all patterns in a long sequence. Based on the interestingness 
measures, algorithms are proposed to discover potentially interesting pat- 
terns for this negative rule problem. Finally, the experiment is made for 
a real application. 



1 Introduction 

In reality, a large number of events are recorded with temporal information (e.g., 
a timestamp). We call a sequence of such events a temporal event sequence. 
In some data mining domains, there is need to investigate very long temporal 
event sequences. Substantial work [1,2, 3, 4,5] has been done for finding frequent 
patterns in a long temporal event sequence, however, most of them treat every 
event type equally and do not consider speciality of events. 

In many applications, not every type of event has equal impact on overall 
data analysis goals. One may be particularly interested in a specific type of event 
and event patterns related to such events. For example, in telecommunication 
network fault analysis, a variety of signals are recorded to form a long tempo- 
ral event sequence. In such a sequence, we could be more interested in events 
representing a fault on the network. Naturally, it is unnecessary to find all fre- 
quent patterns in the sequence as done traditionally. We define events of special 
interest as target events, and patterns related to target events as event- oriented 
patterns. The temporal relationship between target events and event-oriented 
patterns can be specified by temporal constraints. 



H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 212—221, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Let us consider the application of earthquake predictions. In order to find 
event patterns that may potentially lead to earthquakes, many types of events 
that may relate to earthquakes are recorded by independent time controlled 
devices. A temporal event sequence is then formed by ordering these events based 
on their timestamps. To predict an earthquake, it would be ideal to find the event 
pattern (which is a natural phenomenon or a combination of phenomena) that 
happens frequently only in the periods prior to earthquakes. On the contrary, it 
is also interesting to find patterns that happen frequently all the time but before 
earthquakes. Both types of patterns can contribute to the earthquake prediction 
and also complement each other. 

In the above example, we regard events that represent earthquakes as target 
events. Considering two types of patterns, we define the former as positive event- 
oriented pattern and the latter as negative event- oriented pattern. The temporal 
constraint would then be the length of periods for the prediction. The problem of 
finding positive event-oriented patterns has been studied in our previous work [5] . 
In this paper, we study its counterpart, finding negative event-oriented patterns 
in a long sequence. 

Now, we generalize our problem as follows. Let k types of events be given. 
A temporal event sequence is a list of events indexed by their timestamps. Let 
e be a selected event type called target event type. Additionally, let the size of 
time interval T be given. The problem is to find any frequent pattern P of events 
such that the occurrences of P are considerably restrained in T-sized intervals 
before events of type e. The word “restrained” indicates the negative relationship 
between the pattern and target events. It means that the pattern P either rarely 
happens before target events or happens far less frequently before target events 
than in any other periods. Naturally, the mining result of such a problem can be 

T 

expressed as negative rules in the form of ^P — > e. Based on this rule format, 
we explain further our problem in more details. 



pm+2 

n 1 


pfn+J 


pni+4 

n 


^ 


< > 




T,n T„i+2 


T,n+3 

< j ? 


Tm+4 



to 



Fig. 1. A fragment of sequence 



~ The temporal constraints are important to insure the sensibility of nega- 
tive rules. While T specifies the temporal relationship between patterns and 
target events, we introduce another size of interval Tp as the temporal con- 
straint of the pattern itself. A pattern P with the temporal constraint Tp 
regulates that every occurrence of P should happen within an interval of 
size Tp. Figure 1 illustrates dependencies between all introduced notions in 
a fragment of a long sequence. A target event occurs at time tp. T gives 
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the length of the period before the target event. Let P* represent the i-th 
occurrence of pattern P (with temporal constraint Tp) in the sequence. For 
each P®, we should have Ti < Tp where Ti is the time span of P*. In the 
above rule format, Tp is implicitly set with P. Note that both Tp and T are 
given by domain experts. 

— Regarding the order among pattern elements, P could be an existence pat- 
tern (i.e., a set of event types) or a sequential pattern (i.e., a list of event 
types). In this paper, for simplicity of presentation, we only discuss existence 
patterns. All discussions can be easily extended to sequential patterns. 

The challenge of mining negative rules is to define appropriate interestingness 
measures. Considering the format of negative rules, intuitively, we need find 
the pattern that 1) occurs frequently in a long sequence, and 2) occurs rarely 
before target events. To link those two numbers of occurrences, we define a 
key interestingness measure based on unexpectedness [6,2]. That is, a pattern is 
interesting because it is unexpected to the prior knowledge. According to the first 
requirement, all potentially interesting patterns should be frequent in the entire 
sequence. For any such pattern P, we assume that P is distributed uniformly 
over the whole period of the sequence S. Based on this important assumption, 
we can estimate the number of occurrences of P in the periods prior to target 
events. If the real number of occurrences of P is significantly smaller than our 
expectation, the pattern P is said to be unexpected and is, therefore, interesting 
to our problem. According to this idea, we propose the third requirement i.e., 
3) the number of occurrences of P before target events is unexpected. Now, the 
negative event-oriented pattern can be described as: patterns that should happen 
frequently before the target events but actually do not. 

To evaluate the frequency of a pattern P, we need to count the number of 
occurrences of P in a long sequence S. Based on the concept of minimal occur- 
rence (MO) [7], we propose a new approach called unique minimal occurrence 
(UMO), which guarantees that the Apriori property holds for all patterns in a 
long sequence. 

The rest of this paper is organized as follows. Related work is discussed in 
Section 2. In Section 3, the problem of finding negative event-oriented patterns 
is formulated and further discussed. In Section 4, we propose algorithms to find 
a complete set of negative patterns. Section 5 shows the experiment results of a 
real application. Finally, we conclude this paper in Section 6. 

2 Related Work 

In [8,7,1], Mannila et al. have studied how to find episodes (equivalent to patterns 
in our paper) in a long temporal event sequence. There are a few differences 
between our work and theirs. First, we introduce speciality of events and only 
find the patterns related to a selected type of event. Secondly, while their work 
focuses on finding positive relationships between events, we target on identifying 
negative relationships. Last but not least, based on their definition of minimal 
occurrence, we propose the UMO approach to count the number of occurrences 
of a pattern in a long sequence, as will be discussed in detail in Section 3. 
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Mining negative association rules has been studied in [9,10]. The fundamen- 
tal difference between our work and theirs is that we discover negative rules 
from a long temporal event sequence rather than a set of transactions. In [10], 
Wu et al. have defined the interestingness measures using probability theory 
and provided an algorithm to mine both positive and negative association rules 
simultaneously. However, their interestingness measures cannot deal with our 
problem because in a long sequence, it is impractical to count how many times 
a pattern does not happen. In [9], Savasere et al. have defined interestingness 
measures using unexpectedness. They estimated the expected support of rules 
by applying domain knowledge, which is a taxonomy on items. A rule is interest- 
ing if its actual support is significantly smaller the than expected value. In our 
problem, we estimate the expected support value based on the assumption that 
frequent patterns are uniformly distributed in the sequence. Hence, our work is 
domain knowledge independent. 

Dong and Li [11] have worked on the problem of discovering emerging pat- 
terns (EPs) from two datasets. A pattern is defined as an EP if its support 
increases significantly from one dataset to another. Formally, P is an EP if 
Supp(p ^ 2 ) — ^ ""^here D\ and D 2 are two different datasets and g is a threshold. 
This is similar to one of our interestingness measures, whereas, in our problem, 
one dataset is a long sequence and the other is a set of sequence fragments. 
According to this nature, we define different interestingness measures for the 
pattern in each dataset and define the interestingness based on unexpectedness. 

3 Problem Statement 

3.1 Negative Event- Oriented Patterns 

Let us consider a finite set E of event types. An event is a pair (a, t) where 
a £ E and t is the timestamp of the event. We define one special event type e 
as the target event type. Any event of type e is called a target event. A temporal 
event sequence, or sequence in short, is a list of events totally ordered by their 
timestamps. It is formally represented as S' = ((ai, ti) , ( 02 , ^ 2 ) , ■ . . j(on,tra)) 
where Qi G E for 1 < i < n and U < ti+i for 1 < i < n — 1. The duration of 
sequence S is the time span of S, namely, Dur (S) =tn — t\. 

An existence pattern (in short, pattern) P with temporal constraint Tp is 
defined as 1) P C A and 2) events matching pattern P must occur within a time 
interval of size Tp. The length of P is defined as the number of elements in P. 
A pattern P of length I is called Lpattern. 

A negative rule is in the form of r = e|, where e is the target event 

type and P is a negative event- oriented pattern implicitly with the temporal 
constraint Tp. T is the length of interval that specifies the temporal relationship 
between P and e. Generally, r can be interpreted as: the occurrence of P is 
unexpectedly rare in T-sized intervals before target events. 
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3.2 Interestingness Measures 



According to the rule format, the right hand side of a rule is always the target 
event type e. So the key problem is to find negative event-oriented pattern P. 
Remember that we want P satisfies the following three requirements, i.e., P 
is 1) frequent in S, 2) infrequent before target events, and 3) occurs far less 
frequently than expected before target events. In the following part, we will 
discuss how to define a set of interestingness measures for these requirements. 

First, let us suppose there exists a method to return the number of occur- 
rences of a pattern P in a sequence S, denoted as Count (P, S). 



Definition 3.1. Given a pattern P, a sequence S, and the size of time interval 
T, the global support of P in S is defined as Supp{P, S) = ^ where 



N = 



Dur{S) 

T 



-ki. 



From above, we know that the global support of pattern P reflects the fre- 
quency of P in S. Note that, under the assumption of uniform distribution, it 
also gives the expected number of occurrences of P in a T-sized interval. 

Considering requirement 2), we first formally define the dataset before target 
events and then give the interestingness measure. 

Let a window w be [ts,te), where tg and te are the start time and the end time 
of w respectively. The window size of w is the time span of the window, denoted 
as Size{w) = t^ — tg. A sequence fragment f (S,w) is the part of sequence S 
determined by w. Given a target event type e and a sequence S, we can locate all 
target events by creating a timestamp set P® = {t | (e, t) € S}. For each p € T®, 
we create a T-sized window Wi = [U — T,ti) and get the sequence fragment 
fi = f{S,Wi). The set of these sequence fragments D = {/i,/2,... ,/m} is 
called the local dataset of target event type e. 



S; _L 



b € 



db 

I 



Fig. 2. An Example of local dataset 



Example 3.1. Figure 2 presents an event sequence S. Suppose e is the target 
event type. The timestamp set of e is {^5,^81 Pol- W2 and W3 are three T-sized 
windows ending at timestamp t^, ts and fio respectively. The sequence fragments 
/i, /2, and /a of these three windows are ((0,^2) , (0,^3) , (5,^4)), {{f,h) , (^,^7)), 
and {{bGr) ,(e,t8) ,(c,fg)). The local dataset D is {/i,/2,/3|- 

Definition 3.2. Given a pattern P and a local dataset D, the local support of 
P in D is defined as Supp{P,D) = ^ where Count {P, fi) returns 

the number of occurrences of P in fi. 

As every sequence fragment is identified by a window of size T, the local 
support is the average number of occurrences of P in a T-sized interval before 
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target events. Remember that the global support is the expected number of 
P in a T-sized interval. Naturally, we can set the ratio between them as the 
interestingness measure for requirement 3). 

Definition 3.3. Given the global support Supp(P^S) and the local support 
Supp (P, D) of a pattern P, the comparison ratio of P is defined as 

( 1, Supp (P, S') = 0 A Supp (P, D) — 0 

Cr (P) = < 00 , Supp (P, S) / 0 A Supp (P, P) = 0 

Supp(P,S) 

t Supp(P,D) > wise 

3.3 Problem Definition 

The formal definition of finding negative event-orient patterns is: given a se- 
quence S, a target event type e, two sizes of interval T and Pp, and three 
thresholds Sg, si, and cr (cr > 1), find the complete set of rule r = ~ ^ e| 

such that 1) Supp{P, S) > Sg, 2) Supp{P, D) < si, and 3) Cr{P) > cr. 

The first condition guarantees that the pattern has statistical significance. 
While condition 2) requires the absolute sparsity of the pattern before target 
events, condition 3) is a constraint of relative sparsity. Note that whether the 
constraint of absolute sparsity is required or not depends on a domain applica- 
tion. 

From Definition 3.3, we can see that the three interestingness measures are 
not independent. Due to space limitation, we omit the discussion on the rela- 
tionship among Sg, si, and cr. 

3.4 Counting the Number of Occurrences 

In this section, we discuss how to count the number of occurrences of a pattern in 
a sequence or a sequence fragment (i.e., how to define the method Count (P, S) or 
Count (P, fi)). Intuitively, given a pattern P, every occurrence of P corresponds 
to a window [tg, te\ (For brevity, in this section, we do not consider the temporal 
constraint [ts,te] < Tp). 

Mannila et al. [7] have proposed the concept of minimal occurrence (MO) to 
represent that a pattern occurs in a sequence. Using our model, their definition 
can be presented as: a window w is a minimal occurrence of a pattern P iff 
1) f (S,w) contains P, and 2) there does not exist a subwindow w' C w (i.e., 
ts < t'g, te > te, and Size{w) f^Size{w')) such that / {S,w') contains P. 



1 2 3 5 6 8 10 12 13 t 

Fig. 3. An example of sequence 



Example 3.3. A sequence S is visualized in Figure 3. Let P be an existence 
pattern {&, c}. The MOs of P in S' are [1, 2] , [2, 6] , [8, 10] , and [10, 12]. 
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An important observation is that the Apriori property does not hold in the 
MO approach. That is, in a sequence S, the number of MOs of a pattern in 
S may be less thair that of its superpattern. For instance, in Example 3.3, the 
pattern {c} happens twice, however, in terms of MO, {b,c} happens 4 times. 

To rectify this shortcoming, we propose a irew approach called uirique min- 
imal occurrence (UMO). The unique here means that an event can be used to 
match one patterir at most oirce. For iirstance, in Example 3.3, because the eveirt 
(c, 2) has matched P ({6, c}) in the window [1,2], in the UMO approach, we do 
not consider it any more in the following matchiirg of P. Thus, [2,6] is irot a 
UMO of P. According to this idea, the UMOs of P hr S are [1, 2] and [8, 10]. We 
give the formal definition of the UMO as follows. 

Giveir a sequence S, let M {S, P) be a method to find the first MO of P in 
S. Iir addition, let R{S,t) express the part of S after the timestamp t (i.e., the 
part of S in (t,t„], where is the timestamp of the last eveirt in S). 

Definition 3.5. Given a sequence, represented as Sq, the unique minimal 
occurrence of P is defined as: 1) wq is a unique minimal occurrence if 
Wo = M(So,P) yf null, 2) Wi (i > 1) is a unique minimal occurrence if 
Wi = M (Si,P) ^ null where Si = R{Si-i,Wi-i.te)- 

Claim 3 . 1 . The number of unique minimal occurrences of an existence pattern 
is no less than that of its supperpattern. 

Proof. Let P and P' be two existence patterns such that P C P' . For any UMO 
w' of P' , one of following two conditions holds: 1) there exists a subwindow 
w C w' such that w is a UMO of P; 2) there exists a window w" such that w” is 
a UMO of P and w' O w" (j). For any two UMOs of P' , denoted as w[ and w' 
where i y^ j, w[ HWj = (f> always holds. Therefore, the number of UMOs of P' 
cannot exceed that of P. □ 

4 Searching for Interesting Patterns 

In this section, we propose two algorithms to find a complete set of interesting 
patterns. Algorithm 1 is the main algorithm, and Algorithm 2 performs a key 
subtask for finding UMOs for a set of patterns. 

Given a target event type and a set of parameters. Algorithm 1 finds a collec- 
tion of interesting negative event-oriented patterns from a long sequence. From 
Figure 4, we can see that the algorithm consists of three phases. In searching 
phase (phase 1), we discover patterns whose global support is no less than Sg. In 
phase 2, called testing phase, for every pattern discovered in searching phase, we 
compute the local support in the dataset before target events. Finally, patterns 
are pruned based on thresholds of interestingness measures in pruning phase. 

Algorithm 2 (shown iir Figure 5) finds the irumber of UMOs for a set of 
patterns iir a sequence (or a sequence fragment). To avoid unnecessary check, we 
first group patterns according to their event types. For example, group (a) is a 
pattern set that includes all patterns containing event type a. For each pattern 
P, we use count to record the number of UMOs and event wount to record how 
many event types in P have been matched. P.timestamp (a) is the timestamp 
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Algorithm 1 

Input: A sequence S, a target event type e, two sizes of 
interval T and Tp, and four thresholds Sg, si, Ir and cr. 
Output: A complete set of interesting patterns. 
Method: 

/* Searching phase */ 

Fi = {frequent 1-patterns}; 
for(fe = 2; Fk-i ^ 4>-,k + +) do 

Ck =Candidate A:-pattern generated from Fk-i] 
for each Pi G Ck do /* Algorithm 2 */ 

Compute Supp {Pi,S) ; 

Fk = {Pi e Ck\Supp{Pi,S) > Sg|; 

Frequent pattern set F = (J^ 

/* Testing phase */ 
for each (e,ti) in S do 

Create Wi = [U — T,ti) and / {5, ifii); 
for each Pj ^ F do 

Compute the number of UMOs of Pj in /i; 
Update Supp{Pj,D) and Lr {Pj, D)\ 

/* Pruning phase */ 
for each Pi ^ F do 

if {{Supp{Pi, D) < Si and Cr{Pi) > cr) 
or Lr{Pi, D) > Ir) then 
Output Pi] 

Fig. 4. Main algorithm 



Algorithm 2 

Input: A set C of patterns, a sequence (or sequence 
fragment) S and the size of interval Tp. 

Output: the number of UMOs for each pattern. 
Method: 

/* Initialization */ 
for each Pi ^ C do 

Pi.event_count = 0; Pi.count = 0; 
for each aj G Pi do 

Pi.timestamp (aj) = Null] 
group {uj) = group (a^) U {Pi} ] 

/* Data pass */ 

for each in 5 such that i from 1 to n do 

for each Pj G group (ui) do 

if {Pj .timestamp (ui) == Null) then 
Pj .timestamp {ui) = U] 
if (+ -t- Pj .event-Count == |Pj|) then 
/* Check temporal constraint */ 
for each ak G Pj do 

if {ti — Pj .timestamp {uk) > Tp) then 
Pj .timestamp {ak) = Null] 

Pj. event -Count ; 

if {Pj .event-Count == \Pj\) then 
/* a UMO found */ 

Pj.count + +; Pj .event-Count = 0; 
for each ak G Pj do 
Pj .timestamp {ak) = Null] 
else /* Update timestamp */ 

Pj .timestamp {ai) = U] 



Fig. 5. Counting the number of UMOs 



of the event which matches the event type a in pattern P. When P.event-count 
is equal to |P| (i.e., all event types in P have been matched), we check whether 
the temporal constraint (specified by Tp) is satisfied or not. For any event that 
matches P but violates the constraint, we eliminate it by setting its timestamp 
as Null and decrease P.event-count by 1. If no such events exist (i.e., a UMO 
has been found), P. count is increased by 1. After finding a UMO of P, we need 
to set P.event-count = 0 and clear all P. timestamp {a). 



5 Experiment Results 

In this section, we show experiment results of the application of telecommunica- 
tion network fault analysis. In a telecommunication network, various devices are 
used to monitor the network continuously, generating a large number of different 
types of events. One type of event, trouble report (TR), is of particular impor- 
tance because it indicates the occurrence of errors on the network. To investigate 
the dependency between TRs and other types of events, all events collected by 
different devices from the network are integrated into a long sequence in terms 
of their timestamps. In such a sequence, we regard TR as the target event and 
try to find patterns that have negative relationship with TR. 

The telecommunication event database contains 119,814 events, covering 190 
event types. The population of target events is 2,234. In the experiment, we do 
not consider the constraint on the local support (i.e., disregard the condition 
Supp{P, D) < S[) and always set threshold cr as 2.0. Negative patterns are 
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Table 1. An example rule 



Rule 


Interestingness measures 


P: {£^1,F5} 

e: NTN 

Tp (minute): 10 
T (minute): 180 


Supp{P,S)\ 5.73 
Supp {P, D): 1.86 
Cr{py. 3.08 



discovered under different values of other parameters. An example rule is shown 
in Table 1. The rule indicates that the occurrences that both signal El and F5 
happen within 10 minutes are unexpected rare in a 180-minute-interval before 
trouble report NTN. According to this rule, domain experts believe that the 
signal El and F5 have negative relationship with NTN . 




Minimuni number of UMOs (Cg) 

(a) Performance of phase 1 
(cg) Tp — 20m 




(c) Performance of phase 2 
(T) Tp = 20m, Cg = 2000 
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(b) Performance of phase 1 
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Njmber of events (thousand) 



(d) Scalability of Algo- 
rithm 1 Tp = 20m, T = 
180m, Sg = 1.0 



Fig. 6. Performance evaluation 



Performance evaluation can be seen in Figure 6. Particularly, Figure 6(a) 
gives the performance curve of phase 1 with respect to Cg, where Cg is a threshold 

on the number of UMOs, equivalent to s„ * 



Dur(S) 

T 



-hi 



^ . Figure 6(b) shows 
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the effect of parameter Tp on the performance of phase 1. In phase 2, a reverse 
scan on the database is performed. During the scan, we dynamically create the 
sequence fragment and update local support for patterns discovered in phase 1. 
The performance of phase 2 is shown in Figure 6(c) in terms of T. Finally, Figure 
6(d) illustrates the overall performance and the scalability of Algorithm 1. 

6 Conclusions 

Mining negative associations should be treated as important as mining positive 
associations. However, as far as we know, it has been ignored in the research 
of pattern discovery in a long sequence. In this paper, the problem of finding 
negative event-oriented patterns has been identified. After proposing a set of 
interestingness measures, we have designed algorithms to discover a complete set 
of interesting patterns. To count the frequency of a pattern, we have proposed 
the UMO approach, which guarantees that the Apriori property holds for all 
patterns in the sequence. Finally, the experiment is made for a real application, 
which justifies the applicability of this research problem. 

In a long sequence, the relationship between a pattern and target events could 
be positive, negative, or independent. Using the comparison ratio, we can roughly 
classify the pattern space into three categories according to the relationship 
with target events. In future work, we will investigate mining both positive 
and negative patterns in one process. In such a problem, the performance issue 
attracts most of our attention. 
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Abstract. Outlier detection in large datasets is an important problem. 
There are several recent approaches that employ very reasonable defi- 
nitions of an outlier. However, a fundamental issue is that the notion 
of which objects are outliers typically varies between users or, even, 
datasets. In this paper, we present a novel solution to this problem, by 
bringing users into the loop. Our OBE ( Outlier By Example) system is, to 
the best of our knowledge, the first that allows users to give some exam- 
ples of what they consider as outliers. Then, it can directly incorporate a 
small number of such examples to successfully discover the hidden con- 
cept and spot further objects that exhibit the same “outlier-ness” as the 
examples. We describe the key design decisions and algorithms in build- 
ing such a system and demonstrate on both real and synthetic datasets 
that OBE can indeed discover outliers that match the users’ intentions. 



1 Introduction 

In many applications (e.g., fraud detection, financial analysis and health moni- 
toring), rare events and exceptions among large collections of objects are often 
more interesting than the common cases. Consequently, there is increasing at- 
tention on methods for discovering such “exceptional” objects in large datasets 
and several approaches have been proposed. 

However, the notion of what is an outlier (or, exceptional/ abnormal object) 
varies among users, problem domains and even datasets (problem instances): 
(i) different users may have different ideas of what constitutes an outlier, (ii) 
the same user may want to view a dataset from different “viewpoints” and, (iii) 
different datasets do not conform to specific, hard “rules” (if any). 

We consider objects that can be represented as multi-dimensional, numeri- 
cal tuples. Such datasets are prevalent in several applications. From a general 
perspective [4, 7, 8, 2], an object is, intuitively, an outlier if it is in some way “sig- 
nificantly different” from its “neighbors.” Different answers to what constitutes 
a “neighborhood,” how to determine “difference” and whether it is “significant,” 
would provide different sets of outliers. 

Typically, users are experts in their problem domain, not in outlier detection. 
However, they often have a few example outliers in hand, which may “describe” 
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their intentions and they want to find more objects that exhibit “outlier-ness” 
characteristics similar to those examples. Existing systems do not provide a 
direct way to incorporate such examples in the discovery process. 

Example. We give a concrete example to help clarify the problem. The example 
is on a 2-d vector space which is easy to visualize, but ideally our method should 
work on arbitrary dimensionality or, even, metric datasets^. 

Consider the dataset in Figure 1. In this dataset, there are a large sparse clus- 
ter, a small dense cluster and some clearly isolated objects. Only the isolated 
objects (circle dots) are outliers from a “bird’s eye” view. In other words, when 
we examine wide-scale neighborhoods (i.e., with large radius — e.g., covering most 
of the dataset), only the isolated objects have very low neighbor densities, com- 
pared with objects in either the large or the small cluster. However, consider 
the objects on the fringe of the large cluster (diamond dots). These can also be 
regarded as outliers, if we look closer at mid-scale (i.e., radius) neighborhoods. 
Also, objects on the fringe of the small cluster (cross dots) become outliers, if 
we further focus into small-scale neighborhoods. As exemplified here, different 
objects may be regarded as outliers, depending on neighborhood scale (or, size). 

This scenario is intuitive from the users’ perspective. However, to the best of our 
knowledge, none of the existing methods can directly incorporate user examples 
in the discovery process to find out the “hidden” outlier concept that users may 
have in mind. 

In this paper, we propose Outlier By Example (OBE), an outlier detection 
method that can do precisely that: discover the desired “outlier-ness” at the 
appropriate scales, based on a small number of examples. There are several 
challenges in making this approach practical; we briefly list the most important: 
(1) What are the appropriate features that can capture “outlier-ness?” These 
should ideally capture the important characteristics concisely and be efficient 
to compute. However, feature selection is only the tip of the iceberg. (2) Fur- 
thermore, we have to carefully choose exactly what features to extract. (3) The 
method should clearly require minimal user input and effectively use a small 
number of positive examples in order to be practical. Furthermore, it should 
ideally not need negative examples. (4) Given these requirements, can we train 
a classifier using only the handful of positive examples and unlabeled data? In 
the paper we describe the key algorithmic challenges and design decisions in 
detail. 

In summary, the main contributions of this paper are: (1) We introduce 
example-based outlier detection. (2) We demonstrate its intuitiveness and fea- 
sibility. (3) We propose OBE, which, to the best of our knowledge, is the first 
method to provide a solution to this problem. (4) We evaluate OBE on both real 
and synthetic data, with several small sets of user examples. Our experiments 
demonstrate that OBE can succesfully incorporate these examples in the discov- 



^ A metric dataset consists of objects for which we only know the pairwise distances 
(or, “similarity”), without any further assumptions. 
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ery process and detect outliers with “outlier-ness” characteristics very similar to 
the given examples. 

The remainder of the paper is organized as follows: In section 2, we discuss 
related work on outlier detection. In section 3, we discuss the measurement of 
“outlier-ness” and the different properties of outliers. Section 4 presents the pro- 
posed method in detail. Section 5 reports the extensive experimental evaluation 
on both synthetic and real datasets. Finally, Section 6 concludes the paper. 



2 Related Work 



In essence, outlier detection techniques traditionally employ unsupervised learn- 
ing processes. The several existing approaches can be broadly classified into 
the following categories: (1) Distribution-based approach. These are the “clas- 
sical” methods in statistics [1,11]. (2) Depth-based approach. This computes 
different layers of k-d convex hulls and flags objects in the outer layer as out- 
liers [5]. (3) Clustering approach. Many clustering algorithms detect outliers as 
by-products [6]. (4) Distance-based approach. Distance-based outliers [7,8,9,10,3] 
use a definition based on a single, global criterion. All of the above approaches 
regard being an outlier as a binary property. They do not take into account 
both the degree of ” outlier-ness” and where the ” outlier-ness” is presented. (5) 
Density-based approach, proposed by M. Breunig, et al. [2]. They introduced a 
local outlier factor (LOF) for each object, indicating its degree of “outlier-ness.” 
LOF depends on the local density of its neighborhood. The neighborhood is de- 
fined by the distance to the MinPts-th nearest neighbor. When we change the 
value of the parameter MinPts, the degree of ’’outlier-ness” can be estimated in 
different scopes. However, LOF is very sensitive to the selection of MinPts values, 
and it has been proven that LOF cannot cope with the multi-granularity prob- 
lem. (6) LOCI. We proposed the multi-granularity deviation factor (MDEF) and 
LOCI in [12]. MDEF measures the “outlier-ness” of objects in neighborhoods of 
different scales. LOCI examines the MDEF values of objects in all ranges and 
flags as outliers those objects whose MDEF values deviate significantly from the 
local average in neighborhoods of some scales. So, even though the definition of 
MDEF can capture “outlier-ness” in different scales, these differences are up to 
the user to examine manually. 

Another outlier detection method was developed in [15], which focuses on 
the discovery of rules that characterize outliers, for the purposes of filtering new 
points in a security monitoring setting. This is a largely orthogonal problem. 
Outlier scores from SmartSifter are used to create labeled data, which are then 
used to find the outlier filtering rules. 

In summary, all the existing methods are designed to detect outliers based 
on some prescribed criteria for outliers. To the best of our knowledge, this is the 
first proposal for outlier detection using user-provided examples. 
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Fig. 1. Illustration of different Fig. 2. Illustrative dataset and MDEF plots, 
kinds of outliers in a dataset. 



3 Measuring Outlier-ness 



In order to understand the users’ intentions and the “outlier-ness” they are in- 
terested in, a first, necessary step is measuring the “outlier-ness.” It is crucial 
to select features that capture the important characteristics concisely. However, 
feature selection is only the initial step. In OBE, we employ MDEF for this pur- 
pose, which measures “outlier-ness” of objects in the neighborhoods of different 
scales (i.e., radii). 

Detailed definition of the multi-granularity deviation factor (MDEF) is 
given in [12]. Here we describe some basic terms and notation. Let the r- 
neighborhood of an object pi be the set of objects within distance r of pi. 
Let n{pi, ar) and n{pi, r) be the numbers of objects in the ar-neighborhood 
{counting neighborhood) and r-neighborhood {sampling neighborhood) of pi re- 
spectively.^ Let h{pi, r, a) be the average, over all objects p in the r- 
neighborhood of pi, of n{p, a, r). 



Definition (MDEF). For any pj, r and a, the multi- granularity deviation factor 
{MDEF) at radius (or scale) r is defined as follows: 



MDEF{pi, r, a) 



n{pi, r, a) - n{pt, ar) 
h{pi, a, r) 



( 1 ) 



Intuitively, the MDEF at radius r for a point pi is the relative deviation of 
its local neighborhood density from the average local neighborhood density in 
its r-neighborhood. Thus, an object whose neighborhood density matches the 
average local neighborhood density will have an MDEF of 0. In contrast, outliers 
will have MDEFs far from 0. In our paper, the MDEF values are examined (or, 
sampled) at a wide range of sampling radii r, r^^n < r < r^^ax , where r.rnax is the 
maximum distance of all object pairs in the given dataset and rmm is determined 
based on the number of objects in the r-neighborhood of pi. In our experiments, 
for each pi in the dataset, rmin for pi (denoted by rmin,i) is the distance to its 
20-th nearest neighbor. In other words, we do not examine the MDEF value of 
an object until the number of objects in its sampling neighborhood reaches 20. 
This is a reasonable choice which effectively avoids introduction of statistical 
errors in MDEF estimates in practice. 

In all experiments, a = 0.5 as in [12]. 



2 
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Next we give some examples to better illustrate MDEF. Figure 2 shows a 
dataset which has mainly two groups: a large, sparse cluster and a small, dense 
one, both following a Gaussian distribution. There are also a few isolated points. 
We show MDFF plots for four objects in the dataset. 

— Consider the point in the middle of the large cluster, Nm, (at about x = 70, 
y = 68). The MDFF value is low at all scales: compared with its neighbor- 
hood, whatever the scale is, the local neighborhood density is always similar 
to the average local density in its sampling neighborhood. So, the object can 
be always regarded as a normal object in the dataset. 

— In contrast, for the other three objects, there exist situations where the 
MDFFs are very large, some times even approaching 1. This shows that 
they differ significantly from their neighbors in some scales. The greater the 
MDFF value is, the stronger the degree of ” outlier-ness” . 

Fven though all three objects in Figure 2 can be regarded as outliers, they 
are still different, in that they exhibit “outlier-ness” at different scales. 

~ The MDEF value of the outlier in the small cluster, SC-0, (at about x = 22, 
y = 27), reaches its maximum at radius r Ri 5, then it starts to decrease 
rapidly until it becomes 0 and remains there for a while (in the range of r Ri 

23 45). Then the MDEF value increases again but only to the degree of 0.6. 

The change of MDEF values indicates that the object is extremely abnormal 
compared with objects in the very small local neighborhood (objects in the 
small cluster). 

— On the other hand, the outlier of the large cluster, LC-0, (at about x = 70, 
y = 98), exhibits strong “outlier-ness” in the range from r = 10 to r = 30, 
then becomes more and more ordinary as we look at a larger scale. 

— For the isolated outlier, 0-0, (at about x = 47, y = 20), its MDEF value 
stays at 0 up to almost r = 22, indicating that it is an isolated object. Then, 
it immediately displays a high degree of “outlier-ness.” 



4 Proposed Method (OBE) 

4.1 Overview 

OBE detects outliers based on user-provided examples and a user-specified frac- 
tion of objects to be detected as outliers in the dataset. OBE performs outlier 
detection in three stages: feature extraction step, example augmentation step 
and classification step. Figure 3 shows the overall OBE framework. 

4.2 Feature Extraction Step 

The purpose of this step is to map all objects into the MDEF-based feature 
space, where the MDEF plots of objects capturing the degree of “outlier-ness,” 
as well as the scales at which the “outlier-ness” appears, are represented by 
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vectors. Let D be the set of objects in the feature space. In this space, each 
object is represented by a vector: Oi = (mio, rnn, . . . , rriin), Oi € D, where 
rriij = MDEF{pi, rj, ar), 0 < j < n, tq = mink{rmin,k), = Tmax, rj = 
+ ro.3 



4.3 Example Augmentation Step 

In the context of outlier detection, outliers are usually few, and the number of 
examples that users could offer is even less. If we only learn from the given exam- 
ples, the information is very little to be used to construct an accurate classifier. 
However, example-based outlier detection is practical only if the number of re- 
quired examples is small. OBE effectively solves this problem by augmenting the 
user-provided examples. 

In particular, the examples are augmented by adding outstanding outliers 
and artificial positive examples, based on the original examples. 

Outstanding Outliers. After all objects are projected into the feature 
space, we can detect outstanding outliers. The set of outstanding outliers is 
defined by {Oi \ max-M{Oi) > K, Oi € D}, where max-M{Oi) = maxj(rriij) 
and A is a threshold. 

Artificial Examples. The examples are further augmented by creating 
“artificial” data. This is inspired by the fact that an object is sure to be an 
outlier if all of its feature values (i.e., MDEF values) are greater than those of 
the given outlier examples. Figure 4 shows the created artificial data and the 
original example. 

Artificial data are generated in the following way: (1) Take the difference 
between the max-M{Oi) and the threshold K, Diff _M{i) = K — max-M{Oi). 
(2) Divide the difference, Diff-M{i), into x intervals, where x is the number 
of artificial examples generated from an original outlier example plus 1. For 
instance, if the intended augmentation ratio is 200%, two artificial examples are 
generated from each original example. Then we divide Diff (i) into 3 intervals 
{x = 3), Intv-M{i) = Diff_M{i)/x. (3) Then, create artificial examples as: 
0_A(i, j)=(mio + j * Intv-M(i), mu +j* Intv-M(i), . . . , +j* Intv-M(i)} 

for 1 < j < a; — 1. Here, 0-A(i, j) is the j-th artificial example generated from 
object Oi- 

In this way, the “outlier-ness strength” of the user’s examples is amplified, 
in a way consistent with these examples. 

Putting together the original examples, outstanding outliers and artificial exam- 
ples, we get the positive training data. 



4.4 Classification Step 

So far, the (augmented) positive examples, as well as the entire, unlabeled 
dataset are available to us. The next crucial step is finding an efficient and 

® More precisely, if rj > rmin,i, mtj = MDEF{pi, rj, ar), otherwise, mtj = 0. 
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MDEF Plot 





Fig. 3. The Framework of OBE Fig. 4. The Artificial and the Original Examples 



effective algorithm to discover the “hidden” outlier concept that the user has in 
mind. 

We use an SVM (Support Vector Machine) classifier to learn the “outlier- 
ness” of interest to the user and then detect outliers which match this. Tra- 
ditional classifier construction needs both positive and negative training data. 
However, it is too difficult and also a burden for users to provide negative data. 
Most objects fall in this category and it is unreasonable to expect users to ex- 
amine them. 

However, OBE addresses this problem and can learn only from the positive 
examples obtained in the augmentation step and the unlabeled data (i.e., the 
rest of the objects in the dataset). The algorithm shown here uses the marginal 
property of SVMs. In this sense, it bears some general resemblance to PEBL [13], 
which was also proposed for learning from positive and unlabeled data. However, 
in PEBL, the hyperplane for separating positive and negative data is set as close 
as possible to the set of given positive examples. In the context of OBE, the 
positive examples are just examples of outliers, and it is not desirable to set 
the hyperplane as in PEBL. The algorithm here decides the final separating 
hyperplane based on the fraction of outliers to be detected. Another difference 
between OBE and PEBL is that strong negative data are determined taking the 
characteristics of MDEF into consideration. 

The classification step consists of the following five sub-steps. 

Negative training data extraction sub-step. All objects are sorted in 
descending order of max_M. Then, from the objects at the bottom of the list, 
we select a number of (strong) negative training data equal to the number of 
positive training data. Let the set of strong negative training data be NEG. Also, 
let the set of positive training data obtained in the example augmentation step 
be POS. 

Training sub-step. Train a SVM classifier using POS and NEG. 

Testing sub-step. Use the SVM to divide the dataset into the positive set 
P and negative set N. 

Update sub-step. Replace NEG with N, the negative data obtained in the 
testing sub-step. 

Iteration sub-step. Iterate from the training sub-step to the updating 
sub-step until the ratio of the objects in P converges to the fraction specified by 
the user. The objects in the final P are reported to the user as detected outliers. 
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Input: 

Set of outlier examples: E 
Fraction of outliers: F 
Dataset: D 

Output: 

Outliers like examples 
Algorithm: 

00 := 0 // Outstanding outliers 

/ / Feature extraction step: 

For each pi £ D 

For each j (0 < j < n) 

Compute MDEF value m,ij 
If mij > K 

Then 00 := 00 U {pi} 



// Example augmentation step: 

For each example in E 
Create artificial examples 
POS E U 00 U artificial examples 
// Classification step: 

NEG := strongest negatives 
P ■- D 
Do { 

P' ■- P 

SVM := train_SVM {POS, NEG) 
{P, N) ■- SVM.classify (D) 

NEG ■- N 

} while (|P| > F* |0| and |P| / |P'|) 
return P' 



Fig. 5. The Overall Procedure of OBE 



Table 1. Description of synthetic and real datasets. 



Dataset 


Description 


Uniform 


A 6000-point group following an uniform distribution. 


Ellipse 


A 6000-point ellipse following a Gaussian distribution. 


Mixture 


A 5000-point sparse Gaussian cluster, a 2000-point dense Gaussian clus- 
ter and 10 randomly scattered outliers. 


NYWomen 


Marathon runner data, 2229 women from the NYG marathon: average 
pace (in minutes per mile) for each stretch (6.2, 6.9, 6.9, and 6.2 miles). 



Figure 5 summarizes the overall procedure of OBE. 

5 Experimental Evaluation 

In this section, we describe our experimental methodology and the results ob- 
tained by applying OBE to both synthetic and real data, which further illustrate 
the intuition and also demonstrate the effectiveness of our method. 

We use three synthetic and one real datasets (see Table 1 for descriptions) 
to evaluate OBE. 

5.1 Experimental Procedure 

Our experimental procedure is as follows: 

1. To simulate interesting outliers, we start by selecting objects which rep- 
resent “outlier-ness” at some scales under some conditions, for instance, 
/\^{miriq, maxq, Condq, Kq), where {minq, maxq, Condq, Kq) stands for 
the condition that {rriij Condq Kq) for some j such that minq < j < maXq, 
where Condq could be either “>” or “<”. 
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Table 2. Interesting Outliers, Discriminants and the Performance of OBE. OO denotes 
outstanding outliers, lO denotes interesting outliers. Precision(Preci-), recall(Reca-) 
and the number of iterations(Iter-) for convergence in the classification step are used 
to show the performance of OBE. 



Dataset 


OO 


Cases 


OBE 


Label 


Discription 


Condition 


10 


Preci- 


Reca- 


Iter- 


Uniform 

Dataset 


0 


U-Fringe 


Fringe 


(0.3, 0.6, >, 0.4) 


330 


82.76 


88.18 


8.1 


U-Corner 


Corner 


(1, 2, >, 0.5) 


274 


91.90 


97.92 


4.1 


Ellipse 

Dataset 


15 


E-Fringe 


Fringe 


(5, 30, >, 0.85) 


214 


90.20 


93.55 


6.1 


E-Long 


Long Ends 


(15, 25, >, 0.8) 
(30, 40, >, 0.6) 


140 


88.67 


92.14 


5.4 


E-Short 


Short Ends 


(5, 15, >, 0.8) 
(35, 40, <, 0.6) 


169 


76.46 


80.00 


10.4 


Mixture 

Dataset 


29 


M-All 


All 


(1, 35, >, 0.9) 


166 


86.32 


93.80 


4.5 


M-Large 


Large Cluster 


(15, 35, >, 0.9) 


123 


91.52 


95.37 


4.6 


M-Small 


Small Cluster 


(1, 5, >, 0.9) 


72 


91.30 


97.92 


5.3 


NYWomen 

Dataset 


17 


N-FS 


Very Fast/Slow 


(800, 1400, >, 0.7) 


91 


81.53 


84.95 


6.5 


N-PF 


Partly Fast 


(300, 500, >, 0.8) 
(1400, 1600, <,0.4) 


126 


73.07 


78.81 


6.9 


N-SS 


Stable Speed 


(100, 300, >, 0.8) 
(400, 600, <, 0.3) 


121 


66.55 


70.74 


9.2 



2. Then, we “hide” most of these outliers. In particular, we randomly sample 
y% of the outliers to serve as examples that would be picked by a user. 

3. Next, we detect outliers using OBE. 

4. Finally, we compare the detected outliers to the (known) simulated set of 
outliers. More specifically, we evaluate the success of OBE in recovering the 
hidden outlier concept using precision/recall measurements. 

OBE reports as interesting outliers the outstanding ones, as well as those 
returned by the classifier. Table 2 shows all the sets of interesting outliers along 
with the corresponding discriminants used as the underlying outlier concept in 
our experiments. In the table, for instance, the discriminant ( 1, 35, >, 0.9 ) 
means that objects are selected as interesting outliers when their MDEF values 
are greater than 0.9 in the range of radii from 1 to 35. The number of the 
outstanding outliers and interesting outliers is also shown in Table 2. We always 
randomly sample 10% {y = 10) of the interesting outliers to serve as user- 
provided examples and “hide” the rest. 

To detect outstanding outliers, we use K = 0.97 for all the synthetic datasets 
and K = 0.99 for the NYWomen dataset. The discovered outstanding outliers of 
the synthetic datasets are shown in Figure 6. Also, during the augmentation step, 
we always generate 5 (x = 6) artificial examples from each original example. 

We use the LIBSVM [14] implementation for our SVM classifier. We exten- 
sively compared the accuracy of both linear and polynomial SVM kernels and 
found that polynomial perform consistently better. Therefore, in all experiments. 
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Fig. 6. Outstanding Outliers in the Synthetic Datasets. 



we use polynomial kernels and the same SVM parameters"^ . Therefore, the whole 
processes can be done automatically. We report the effectiveness of OBE in dis- 
covering the “hidden” outliers using precision and recall measurements: 

_ # 0 / correct positive predictions 

Frecision = — 

# Of positive predietions 

# 0 / correct positive predictions 
^ of positive data 



(2) 

(3) 



5.2 Results 

Uniform dataset. Figure 7 shows the outliers detected by OBE. Although one 
might argue that no objects from an (infinite!) uniform distribution should be 
labeled as outliers, the objects at the fringe or corner of the group are clearly 
“exceptional” in some sense. On the top row, we show the interesting outliers, 
original examples and the detected results for case U-Fringe. The bottom row 
shows those for case U-Corner (see Table 2 for a description of the cases). Note 
that the chosen features can capture the notion of both “edge” and “corner” 
and, furthermore, OBE can almost perfectly reconstruct these hidden outlier 
notions! 

Ellipse dataset. We simulate three kinds of interesting outliers for the ellipse 
dataset: (i) the set of fringe outliers whose MDEF values are examined at a wide 
range of scales, (ii) those mainly spread at the long ends of the ellipse which 
display outlier-ness in two ranges of scales (from 15 to 25 and fromSO to 40), 
and (iii) mainly in the short ends, which do not show strong outlier-ness in the 
scales from 35 to 40. The output of OBE is shown in Figure 8. Again, the features 
can capture several different and interesting types of outlying objects and OBE 
again discovers the underlying outlier notion! 

Mixture dataset. We also mimic three categories of interesting outliers: (i) the 
set of outliers scattered along the fringe of both clusters, (ii) those maily spread 
along the fringe of the large cluster, and (iii) those mainly in the small cluster. 
Due to space constraints, the figure is omitted here. 

For the parameter C (the penalty imposed on training data that fall on the wrong 

side of the decision boundary), we use 1000, i.e., a high penalty to mis-classification. 

For the polynomial kernel, we employ a kernel function of {u' * v + 1)^. 
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Detection Results 




Fig. 7. Detection Results on the Uniform Dataset. Top row: case U-Fringe, bottom 
row: case U-Corner — see Table 2 for description of each case. 



Interesting Outliers 



Original Examples 



Detection Results 




Fig. 8. Detection Results on the Ellipse dataset. From top to bottom, in turn: case 
E-Fringe, case E-Long, case E-Short — see Table 2 for description of each case. 

NYWomen dataset. In the real dataset, we mimic three kinds of intentions 
for outliers: The first group (case N-FS) is the set of consistently fast or slow 
runners (i.e., the fastest 7 and almost all of the 70 very slow ones). The second 
group of outlying runners (case N-PF) are those who are at least partly fast. 
In this group, we discover both the fastest 23 runners and those runners who 
were abnormally fast in one or two parts of the four stretches, although they 
rank middle or last in the whole race. For example, one of them took 47 minutes 
for the first 6.2 miles, while 91 minutes for the last 6.2 miles. The third set of 
interesting outliers (case N-SS) is those who run with almost constant speed 
and rank middle in the whole race. They are very difficult to perceive, but they 
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Fig. 9. Detection Results on the NYWomen Dataset. From top to bottom in turn: 
Case N-FS, Case N-PF, Case N-SS — see Table 2 for description of each case. Only the 
hrst and forth dimensions are used for the plots, although NYWomen Dataset is four 
dimensional. 



certainly exhibit “outlier-ness” when we examine them at a small scale. Because 
of space limits, we only show the result plots in the first and forth dimensions 
— see Figure 9. 

For all datasets, Table 2 shows the precision and recall measurements for 
OBE, using polynomial kernels (as mentioned, polynomial kernels always per- 
formed better than linear kernels in our experiments). It also shows the number 
of iterations needed to converge in the learning step. In Table 2, all the mea- 
surements are averages of ten trials. In almost all cases, OBE detects interesting 
outliers with both precision and recall reaching 80-90%. In the worst case (case 
N-SS of NYWomen ), it still achieves 66% precision and 70% recall. The number 
of iterations is always small (less than 10). 

6 Conclusion 

Detecting outliers is an important, but tricky problem, since the exact notion 
of an outlier often depends on the user and/or the dataset. We propose to solve 
this problem with a completely novel approach, namely, by bringing the user in 
the loop, and allowing him or her to give us some example records that he or 
she considers as outliers. 

The contributions of this paper are the following: 

— We propose OBE, which, to the best of our knowledge, is the first method 
to provide a solution to this problem. 
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~ We build a system, and described our design decisions. Although OBE ap- 
pears simple to the user (“click on a few outlier-looking records”), there 
are many technical challenges under the hood. We showed how to approach 
them, and specifically, how to extract suitable feature vectors out of our 
data objects, and how to quickly train a classifier to learn from the (few) 
examples that the user provides. 

— We evaluated OBE on both real and synthetic data, with several small sets 
of user examples. Our experiments demonstrate that OBE can succesfully 
incorporate these examples in the discovery process and detect outliers with 
“outlier-ness” characteristics very similar to the given examples. 
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Abstract. In many real world applications, systematic analysis of rare 
events, such as credit card frauds and adverse drug reactions, is very 
important. Their low occurrence rate in large databases often makes it 
difficult to identify the risk factors from straightforward application of 
associations and sequential pattern discovery. In this paper we introduce 
a heuristic to guide the search for interesting patterns associated 
with rare events from large temporal event sequences. Our approach 
combines association and sequential pattern discovery with a measure 
of risk borrowed from epidemiology to assess the interestingness of 
the discovered patterns. In the experiments, we successfully identify 
a known drug and several new drug combinations with high risk of 
adverse reactions. The approach is also applicable to other applications 
where rare events are of primary interest. 



1 Introduction 

The present work is motivated by the specific domain of temporal data mining 
in health care, especially adverse drug reactions. Adverse drug reactions occur 
infrequently but may lead to serious or life threatening conditions requiring hos- 
pitalisation. Thus, systematic monitoring of adverse drug reactions is of financial 
and social importance. The availability of a population-based prescribing data 
set, such as the Pharmaceutical Benefits Scheme (PBS) data in Australia, linked 
to hospital admissions data, provides an opportunity to detect common and rare 
adverse reactions at a much earlier stage. The problem domain has the following 
characteristics: (1) Primary interest lies in rare events amongst large datasets; 
(2) Factors leading to rare adverse drug reactions include temporal drug expo- 
sure; (3) Rare events are associated with a small proportion of patients yet all 
data for all patients are required to assess the risk. 

For adverse drug reactions, we usually have little prior knowledge of what 
drug or drug combinations might lead to unexpected outcomes. Our aim is to 

* The authors acknowledge the valuable comments from their colleagues, including C. 
Carter, R. Baxter, R. Sparks, and C. Kelman, as well as the anonymous reviewers. 
The authors also acknowledge the Commonwealth Department of Health and Ageing, 
and the Queensland Department of Health for providing data for this research. 
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discover patterns associated with rare events that are then further assessed for 
their possible relationship with adverse outcomes. Different from some previous 
work on mining interesting patterns for group difference [2,3] and health data [8, 
4], we propose an approach which extends association and sequential pattern 
discovery with a heuristic measure motivated from epidemiology. We discover 
patterns that have high local support but generally low overall support and 
assess their significance using an estimate of a measure of risk ratio. The paper 
is organised as follows. Formal definitions and our methods are presented in 
Section 2. Section 3 reports on some encouraging results. The conclusion and 
discussion are given in Section 4. 

2 Mining Temporal Association for Rare Events 

Consider a collection of entities Ci {i = 1,2,...) G E, for exam- 

ple, credit card holders of a bank or patients in hospital. The ac- 
tivities of each entity are recorded as an event sequence Si = 
(e* 2 , t* 2 ), • ■ ■ , . . . , where n, is the number of events 

for the ith entity. For each event (eij,tij), Cij indicates an event type and the 
timestamp tij indicates the time of occurrence of the event. For example, the 
following sequence describes a set of medical services received by a patient: 

((G03CA, 1), (JOIDA, 7), (C08CA, 10), (C09AA, 10), {Angioedema, 30)) . 

On day 1 the patient was dispensed the drug estrogen, whose ATC code is 
G03CA. They then took cephalosporins and related substances (JOIDA) on the 
day, and dihydropyridine derivatives (C08CA) and ace inhibitor (C09AA) 
on the 10*^ day. Twenty days later they were hospitalised due to angioedema. 
We refer to this particular event of interest as the target event, which can be 
either within the studied event sequence of an entity or just an external event 
not included in the event sequence but associated with the entity. 

Using the target event as a classification criterion, we partition the entities 
into two subsets. The first one, denoted hy T C E, contains entities having at 
least one target event. The second one, T C E, consists of all remaining entities. 
Note that the time spans can be quite long in any particular sequence. Quite 
often, only events occurring within a particular lead up period, prior to the target 
event, are relevant. We first define a time window and its associated segments 
as follows. 

Definition 1. [ts,te] is a time window that starts at time ts and ends at time 
te, where w = tg — tg is constant, and usually specified by a domain expert. 

Definition 2. ((cip, Up), (e^p+i, ti_p_l-i), ..., (e^g, U^)) is a windowed segment 
of sequence Si with time window [U,te] if ts < tip < ti^p+i, . . . ,tiq < tg < Um, 
U,p— 1 ^ is ^nd g_|_2 ^ tg. 

Definition 3. For any target entity Ci £ T, a target segment of Si is a win- 
dowed segment of Si with time window [UjU] where tg indicates the first occur- 
rence time of the target event. 
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Definition 4. For any entity Ci € T, a virtual target segment of Si is a 
windowed segment of Si with time window [ts,te\. 



There may be more than one target event associated with an entity in T. For 
simplicity, only the first target event is considered. Medical advice suggested 
that only events occurring within some time window prior to the target event 
might be considered in this exploration. For example, drug usage within six 
month prior to the adverse reaction is of main interest in our application. Given 
the fixed window length w, we have a list of virtual target segments for entity 
with different starting timestamps, ■•■,or,ti/., where tu. is the first 

element in {tn,ti2, ■■■,tinf) such that tu. > tin- —w. We denote all these k virtual 
target segments as T(f, w). Also, we can prove that any non-empty virtual target 
segment with tg € must be in L{i,w). 

We introduce a risk ratio, as often used in epidemiological studies, to measure 
the association between a factor and a disease, i.e., being a ratio of the risk of 
being disease positive for those with and those without the factor [1, p672]. We 
use the following estimate of a risk ratio for a candidate pattern p occurring 
within a fixed sized time window: 



^ |r|sT(p) , |T|(1-st(p)) 

\T\st{p) + \T\sj^{p) |T|(1 - st(p)) + |T|(1 - Sj^{p)) 



( 1 ) 



where st(p) is the support in T of pattern p, defined as the proportion of 
entities in T having p in their target segments, and S;p(p) is the support in T 
of p, the proportion of entities in T containing p in their virtual target segments, 
i.e., p is contained in any element of L{i, w). A risk ratio of 1 (i.e., RR{p, w) = 1) 
implies that there is equal risk of the disease with or without pattern p within 
a fixed sized time window. When RR{p,w) > 1, there is a greater risk of the 
disease in the exposed group. 

We employ both the support in T and the estimated risk ratio for identifying 
interesting patterns. The framework for mining interesting patterns associated 
with rare events includes the following steps. Firstly, we extract two datasets of 
entities in the problem domain. Each entity records demographic data and an 
event sequence. The first dataset contains all entities with at least one target 
event. The second dataset contains the other entities. Secondly, we partition the 
entities of the two datasets into sub-populations according to their demograph- 
ics. Thirdly, we discover candidate association [6] and sequential patterns [5] 
in T of each sub-population. The events within a fixed sized time window prior 
to the target event are used for pattern discovery. Fourthly, we explore corre- 
sponding T of each sub-population to identify patterns mined in the above step. 
Finally, estimated risk ratios of the candidate patterns for each sub-population 
are calculated according to Equation 1. 



3 Experimental Results 

The Queensland Linked Data Set [7] links hospital admissions data from Queens- 
land Health with the pharmaceutical prescription data from Commonwealth De- 
partment of Health and Ageing, providing a de-identified dataset for analysis. 
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Table 1. Estimated risk ratio of sample discovered association patterns. |T|/|r| for 
these cohorts are 55/104257 (Male 20-59), 76/194789 (Female 20-59), and 73/128586 
(Female 60-I-) 



Gender 


Age 


Pattern 


support % 


chi-square 


RR 


Male 


20-59 


C09AA N06AA 


9.0 


8.896 


3.684 


Male 


20-59 


N06AA H02AB 


9.0 


5.613 


2.888 


Female 


20-59 


C09AA G03CA 


9.2 


8.991 


3.090 


Female 


60+ 


C09AA G03CA 


24.6 


19.45 


3.112 


Female 


60+ 


C09AA G08CA 


26.0 


4.435 


1.741 



The record for each patient includes demographic variables and a sequence of 
PBS events for a five year period. Two datasets are extracted. One contains all 
299 patients with hospital admissions due to angioedema, e.g., target event. The 
other contains 683,059 patients who have no angioedema hospitalisations. 

It should be noted that the studied population consists of hospital patients 
rather than the whole Queensland population. We make the assumption that 
prior to the period of study these patients were not hospitalised for angioedema. 

Tables 1 and 2 list experimental results for particular age/gender cohorts. 
The fourth column lists the support for the pattern in the target dataset. The 
other columns show the chi-square value and the estimated risk ratio, respec- 
tively. Here we use the chi-squares value, which is calculated together with the 
estimated risk ratio, as a threshold to constrain resulting patterns. Thus, for 
males aged 20-59 the drugs C09AA and N06AA within six months are over 
three times more likely to be associated with angioedema patients than with 
the non-angioedema patients. The following are some of the interesting sequen- 
tial patterns. Usage of estrogen (G03CA) followed by ace inhibitor (C09AA), 
which is a known drug possibly resulting in angioedema, within six months is 
associated with the estimated risk ratio 2.636 of angioedema for females aged 
60-1- ; The sequence consisting of dihydropyridine derivatives (C08CA) and ace 
inhibitor (C09AA) within six months is generally associated with a high esti- 
mated risk ratio of angioedema for females aged 60-1- . Interestingly, four of these 
sequences begin with the pair of C08CA and C09AA, which means the two drugs 
are supplied to patients on the same day. These proposed hypotheses then form 
the basis for further statistical study and validation. 



Table 2. Estimated risk ratio of sequential patterns. |T|/|T| for these cohorts are 
76/194789 (Female 20-59), and 73/128586 (Female 60-f). 



Gender 


Age 


Pattern 


support % 


chi-square 


RR 


Female 


20-59 


G09AA -1 C09AA 


15.7 


7.210 


2.273 


Female 


60+ 


G03CA -1 G09AA 


20.5 


12.11 


2.636 


Female 


60+ 


C08CA C09AA -1 C08CA -1 C09AA -1 C08CA 


17.8 


9.207 


2.453 


Female 


60+ 


G08CA C09AA -1 G09AA -1 C08CA -1 C08CA 


17.8 


9.053 


2.436 


Female 


60+ 


C08CA -1 G09AA -1 C09AA -1 C08CA 


20.5 


9.395 


2.365 


Female 


60+ 


G09AA -1 C08CA -1 C09AA -1 C08CA -1 C09AA 


19.1 


8.751 


2.346 


Female 


60+ 


C08CA C09AA -1 C09AA -1 C09AA 


19.1 


8.678 


2.339 


Female 


60+ 


C09AA -1 G03CA 


17.8 


7.687 


2.280 


Female 


60+ 


G08CA C09AA -1 G08CA -1 C09AA 


17.8 


7.144 


2.217 


Female 


60+ 


C09AA -1 C08CA -1 C08CA -1 C09AA -1 C09AA 


17.8 


6.491 


2.139 
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4 Conclusion and Discussion 

We have presented a temporal sequence mining framework for rare events and 
successfully identified interesting associations and sequential patterns of drug 
exposure which leads to a high risk of certain severe adverse reactions. Our intent 
is to generate hypotheses identifying potentially interesting patterns, while we 
realise that further validation and examination are necessitated. We also note 
that matching of patients has not included matching for time. It is proposed that 
case matching for time will be a refinement which could improve the approach 
to reduce the potential for false positives. In estimating risk ratios, it is not 
clear how to calculate confidence intervals in this case. Confidence intervals are 
important in identifying the degree of uncertainty in the estimates, and further 
work is required to investigate this deficiency. Besides adverse drug reactions, our 
approach may also be applied to a wide range of temporal data mining domains 
where rare events are of primary interest. 
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Abstract. This paper presents a method to summarize massive space- 
craft telemetry data by extracting significant event and change patterns 
in the low-level time-series data. This method first transforms the numer- 
ical time-series into a symbol sequence by a clustering technique using 
DTW distance measure, then detects event patterns and change points 
in the sequence. We demonstrate that our method can successfully sum- 
marize the large telemetry data of an actual artificial satellite, and help 
human operators to understand the overall system behavior. 



1 Introduction 

In recent years, a lot of datamining techniques for time-series data such as sim- 
ilar pattern search [1], [2], pattern clustering [3], [4], [5], event detection[6],[7],[8], 
change-point detection[9],[10],[ll], and temporal association rule mining[12] have 
been studied actively. These techniques have been successfully applied to various 
domains dealing with vast time-series data such as finance, medicine, biology, 
robotics, etc. In the meantime, telemetry data of spacecrafts or artificial satellites 
is also a huge time-series data set usually containing thousands of sensor outputs 
from various system components. Though it is known that the telemetry data 
often contains some symptoms prior to fatal system failures, the limit-checking 
technique which is ordinarily used in most space systems often fails to detect 
them. 

The purpose of this paper is to propose a data summarization method which 
helps human experts to find the anomaly symptoms by extracting important 
temporal patterns from the telemetry data. The summarization process con- 
sists of symbolization of the originally numerical time-series and detection of 
event patterns and change-points. This data-driven approach to the fault de- 
tection problem is expected to overcome some limitations of other sophisticated 
approaches such as expert systems and model-based simulations which require 
the costly a priori expert knowledge. We also show some results of applying the 
method to a telemetry data set of an actual artificial satellite ETS-VII (Engi- 
neering Test Satellite VII) of NASDA (National Space Development Agency of 
Japan). 
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(b) Mode Change 



Fig. 1. Examples of event and change patterns in time-series data 



2 Proposed Method 

2.1 Basic Idea 

The purpose of our method is to make a summary of the telemetry (HK) data 
automatically by preserving only important information while discarding the 
other. This helps the operators to understand the health status of the systems, 
and hence increases the chance to find subtle symptoms of anomalies that could 
not be detected by the conventional techniques. 

A non-trivial problem here is that we need to decide “what is important 
information” in the data beforehand. As to the HK data of spacecrafts, we judged 
that the following two kinds of information are especially important based on 
the investigation of past failure cases and interviews with experts. 

1 . Immediate events • • • Patterns that are distinct from other neighboring parts 

2. Mode changes • • • Points where the characteristics of the time-series change 

Fig. 1 shows examples of the event and change patterns. They are considered to 
have certain important information in that it corresponds to some actual events 
in the system such as “engine thrustings”, “changes of attitude control mode”, 
and so on. In the remainder of this section, we describe the ways of detecting 
the immediate events and mode changes from the data. 

2.2 Detection of Immediate Events 

The symbolization of the original HK data and detection of the immediate events 
are described as follows. First, each time-series in the HK data is divided into a 
set of subsequences with a fixed length. Then all the subsequences are grouped 
into clusters based on the DTW (Dynamic Time Warping) distance measure. 
Finally, each cluster is assigned a unique symbol, and the subsequences contained 
in “small” clusters are detected as event patterns. 

Selecting the number of clusters is a common problem for all clustering meth- 
ods. In the current implementation, although the system mostly recommends a 
proper value based on the MDL (Minimum Description Length) criterion, it is 
sometimes necessary for human operators to adjust the parameter in order to 
obtain a better result. 
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2.3 Detection of Mode Changes 

In the proposed method, detection of the mode change points in the HK data 
is realized by finding an optimum segmentation of the symbolized time-series. 
Suppose a sequence of symbols [si, . . . , St, ■ ■ ■ , sn] is divided into M segments. 
Then we model each segment by a 0-order Markov model, and evaluate the 
goodness of this segmentation by the sum of modelling losses for the segments. 
We search for the best segmentation that minimizes the sum of modelling losses, 
and define the borders of segments as the mode change points. 

This change-detection process also has an open problem of how to decide the 
number of segments M, which is similar to the decision problem of number of 
clusters in the symbolization process. In the current implementation, the subtle 
adjustment is up to the operators. 
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Fig. 2. (Example 1) Co-occurrences of events and changes 
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Fig. 3. (Example 2) Association among events and changes of D60050, D61040, and 
D61050 



3 Case Study 

We implemented the methods described above and applied it to the HK data of 
ETS-VII for four years. In this case study, we picked up 6 time-series relating to 
the AOCS (Attitude & Orbit Control Subsystem). Their brief descriptions are 
given in Table 1. 
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Table 1. List of time-series in ETS-VIFs HK data chosen for this case study 



ID 


Explanation 


D60040 

D60050 

D60060 

D61030 

D61040 

D61050 


Drive signal of AOCS reaction wheel (Roll) 
Drive signal of AOCS reaction wheel (Pitch) 
Drive signal of AOCS reaction wheel (Yaw) 
Incremental angle of IRU (Roll) 

Incremental angle of IRU (Pitch) 
Incremental angle of IRU (Yaw) 




Fig. 4. (Example 3) Transition of relationship among event frequencies 
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Fig. 5. (Example 4) Detection of an event or distinctive pattenr 



Fig. 2 shows an example of the occurrence pattern of the events and changes 
in the 6 time-series for one day. We can easily notice the associations among 
the series. Fig. 3 also gives a summary of events and changes for another day. 
In this figure, we can see a characteristic association pattern among the series. 
That is to say, D61030, D61040 and D61050 become suddenly active right after 
an event occurs in D60050 (access period 2588), and then become inactive again, 
responding to the simultaneous events in D60040, D60050 and D60060 (period 
2593). Fig. 4 is a summary of the 4 years’ data in a more abstract way. It shows 
the transition of frequencies of the events and changes in each series per day. 
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We can browse the global trend of the system activities and associations among 
the series. Fig. 5 shows an example of anomalous patterns detected in D61050 
by the proposed method. 

4 Conclusion 

In this paper, we presented a method to summarize spacecraft telemetry data and 
to visualize most important information in it for monitoring the health status of 
the spacecraft systems. It focuses on two kinds of temporal patterns - “event” 
and “change” in the time-series, and extracts them by combining techniques of 
pattern clustering and change-point detection. 
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Abstract. This paper proposes an extended negative selection algorithm for 
anomaly detection. Unlike previously proposed negative selection algorithms 
which do not make use of non-self data, the extended negative selection algorithm 
first acquires prior knowledge about the characteristics of the problem space from 
the historical sample data by using machine learning techniques. Such data 
consists of both self data and non-self data. The acquired prior knowledge is 
represented in the form of production rules and thus viewed as common schemata 
which characterise the two subspaces: self-subspace and non-self-subspace, and 
provide important information to the generation of detection rules. One advantage 
of our approach is that it does not rely on the structured representation of the data 
and can be applied to general anomaly detection. To test the effectiveness, we test 
our approach through experiments with the public data set iris and KDD’99 
published data set. 



1 Introduction 



The natural immune system has inspired seientists a great research interest because of 
its powerful information processing capability. It protects biologic bodies from 
disease-causing pathogens by pattern recognition, reinforcement learning and 
adaptive response. The idea of applying computational immunology to solving the 
problem of computer/network security derives from the inspiration that virus and 
network intrusion are analogous to pathogens to human bodies. Negative seleetion 
algorithm, proposed by Stephanie Forest and her researeh group in 1994, has been 
eonsidered to be a highly feasible technique to anomaly detection and has been 
suecessfully applied to computer viruses/network intrusion detection[4], tool breakage 
deteetion[5], times-series anomaly detection[6], Web document classification[7], etc. 
The most striking features of the algorithm are that it does not require any prior 
knowledge of anomalies and can implement distributed anomaly detections in a 
network environment. For a given data set S to be proteeted, a set of detectors R is 
generated in a way that each detector d (a binary string) does not mateh any string s in 
S. The negative seleetion algorithms work in a straightforward way to generate the 
repertories R. It randomly generates a string d and matches d against each string in S. 
If d does not match any string in S then store d in R. This process is repeated until we 
have enough detectors in R to ensure the desired proteetion. This generate-and-test 
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method requires sampling a quite large number of candidate detectors and the 
computational complexity is exponential to the size of self data set 5'[8]. 

The originalnegative selection algorithm can be simply described as follows: 

• Define self S as the data set a collection of strings of length I over a finite 
alphabet, a collection that needs to be protected ; 

• Generate a set R of detectors, each of which fails to match any string in S\ 

• Monitor S for changes by continually matching the detectors in R against S. 
if any detector ever matches, then a change is known to have occurred, as the 
detector are designed to not match any of the original string in S. 

Variations of negative selection algorithm have also been investigated mainly 
focusing on representation scheme, detector generation algorithm and matching rule. 
Representation scheme has been explored including hyper-rectangle-rule detectors[9], 
fiizzy-rule detectors [10], and hyper-sphere-rule detectors[l 1]. Their corresponding 
detector generation algorithms are negative selection with detection rules, negative 
selection algorithm with fuzzy rules and randomised real-valued negative selection. 
And genetic algorithm or other approaches are employed to generate detection rules. 
Matching rule in negative selection algorithm was analysed and compared in [14] 
which discussed r-contiguous matching, r-chunk matching, Hamming distance 
matching, and its variation Rogers and Tanimoto matching. 

The drawbacks of previous work on negative selection algorithms can be grossly 
summarised as follows: 

• Non-self data in historical data sample are completely ignored and thus make 
no contribution to the generation of detector set. 

• The distribution feature of self data in the problem space is not investigated 
which lead to a large number of detectors to be produced in order to 
implement an effective protection. 

• The distribution feature of randomly generated detectors in the non-self 
space is not analysed. Actually most of the detectors do not take part in any 
detecting task at all. 

In this paper, we propose an extended negative selection algorithm which 
combines computational immunology and machine learning techniques. It first learns 
the prior knowledge about both self and non-self by applying machine learning 
approaches to historical sample data which contains partial non-self data. The prior 
knowledge is represented in the form of production rules and can be viewed as 
common schemata which characterise the two subspaces: self-subspace and non-self- 
subspace. These schemata provide much information for guiding the generation of 
detection rules which are used to monitor a system for anomaly detection. 

The rest of the paper is organized as follows. Section 2 presents the problem 
description. Section 3 introduces the extended negative algorithm. Section 4 
demonstrates some experimental results, and Section 5 comes the conclusions. 
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2 Problem Formulation 



Anomaly detection, from the viewpoint of immunology, aims at identifying whether 
or not a new element in a problem space is non-self when self is known. 

Suppose that a problem space is a n-dimensional vector space, 

where xj is either a symbolic feature or a numeric feature. An element e eS, where 
*e{self, non-self} is a feature vector. Those elements that belong to self form a 
subspace donated as SS, and the non-self subspace NS is defined as the 
complementaryspac e of SS. The two subspaces are described as follows: 

Self subspace: SS=(e i \*^self, i=l,2,...,n} 

Nons elf subspace: NS={ Cj \ *= non-self , j=l,2,...,m, m< n} 

where SSuNS^S and SSnNS=0. 

Given a data set S for detecting anomaly, the characteristic function ^ for 
differentiating self and non-self is defined as follows: 

f 1 , if e . s SS 

10, if e . s NS 

A detector is usually represented as a detection rule, the structure of which is 
described as: 

Xj e [va/‘ , vail ] a . . . a x^ = a x^ e {val " , val" ] abnormal 

where jc,. € [v< 7 /,' ,va/}] means that x, is a real-value feature and Xj = dj indicates that 

feature x, is symbolic. A detection rule defines a hypercube in the complementary 
space of self space. 



3 The Proposed Approach 

3.1 Learning Schemata from Historical Data 

Finding common schemata is also inspired from the natural immune system [19]. 
Bacteria are inherently different from human cells, and many bacteria have cell walls 
made from polymers that do not occur in humans. The immune system recognizes 
bacteria partially on the basis of the existence of these unusual molecules. Common 
schemata represent generic properties of the antigen population and can be obtained 
by some intelligent computational methods. In GA, a schema is viewed as a template 
specifying groups of chromosomes with a common characteristic, or a description of 
hyperplanes through genome space and, generally represented as a string of symbols 
such as “ ##1101## ”. The character # is explained as “don’t care”. In this paper, we 
use the well-known classification algorithm C4.5 to learn common schemata which 
are represented as the conjunctions of attribute-value pairs as follows: 
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e [val\ , val[ ] a ... a = d ^ a e [va/ ‘ , val‘ ] 

Essentially a schema is also a hypercube in the state space(see Figure 1). 





Fig. 1. Self and non-self space characterised by schemata 

In this paper, we exploit classification algorithm C4.5 to learn the schemata from 
historical data. Once a decision tree is built, a path from the root to any leaf forms a 
production rule, and the premise of which can be viewed as a schema. Generally, a 
shorter schema coves more examples than a longer schema. Those schemata which 
identify self data cover the whole self space, whereas the schemata characterising 
non-self data cover only part of non-self space because the non-self data is not 
complete. Both self schemata and non-self schemata provide much information for 
the construction of detection rules. 




n s 



Self schema: 
a=ai A b=bi a c=C2 

Non-self schema: 

a=ai A b=b2 
a=ai A b=bi a c=Ci 



Fig. 2. Some schemata in a decision tree 



3.2 The Extended Negative Selection Algorithm 

In our approach, detection rules are generated in two ways. Some detection rules are 
obtained by learning from examples and the others are semi-randomly generated. 
Thus we have: 

Detection rules = learnt from historical sample + semi-randomly generated 

The algorithm diagram for producing detector set is described in Figure 3. The 
detection rules that are semi-randomly generated which will be explained further, 
match against the common schemata learnt from historical sample. If a deteetion rule 
does not matches any common schema, it is considered as a detection rule and stored 
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in the detection rule set, otherwise it is rejected. This process is repeated until an 
appropriate number of detection rules are obtained. 




Fig. 3. Diagram of generating detection rules 



The two kinds of detection rules perform two different detection tasks respectively: 

• The detection rules that are directly obtained from the set of non-self rules 
detect the non-self data that the same schemata have happened in history. 

• The detection rules that are semi-randomly generated and do not match 
any common schema detect those non-self data that their schemata have 
never met. 

In the monitoring phase, any new data matches against each detection rule in the 
detector set, an anomaly is detected if a match occurs (see Figure 4). 




Fig. 4. Monitoring phase 
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Now we discuss the strategy of semi-randomly generating new detection rules. In the 
original negative selection algorithm, a detection rule is randomly generated and then 
match against self-data to testify its validity. The drawback of this method is that it 
does not use any information that the self-data and non-self data provide and thus 
creates an exponential computational complexity. We call our method an semi- 
random generation of detection rules which means it makes use of the information 
that the common schemata provide, based on that a new detection rule is randomly 
generated. 

After a decision tree is built, the features near the root of the decision tree are 
more frequently included in common schemata than those features near leaves, which 
implies that these root-nearby features are more important and thus have more 
opportunity to be selected into the detection rules. This inspiration is very valuable, 
especially when the number of features is large. In a huge problem space, it is quite 
difficult to generate a detection rule locating in the right place. As we know, C4.5 
uses info-gain as criterion when selects a feature to classify a data set into subgroups. 
The feature selected as the root appears in each common schemata and thus has 1 00% 
opportunity to be selected into a new detection rule, other attributes has a lower 
opportunity to be selected. 

The possibility for a feature x, to be selected into a new detection rule is calculated 
as follows: 



poss{x . ) 



the number of feature x. appears in schemata 
the number of the feature at root appears in schemata 



Once a feature is selected into a new detection rule, the next needs be considerd 
is to choose its one appropriate interval. As shown in figure 5, the light-gray boxes 
represent that these intervals have occurred in the schemata learnt from historical data 
set, the white ones denote those idle intervals, and dark-gray boxes refer to the 
intervals that are selected into detection rules. Our strategy is that a feature x, with a 
high poss(Xi) is more likely to be selected with its previously occurred intervals and a 
feature with a low poss(xi) is more likely to be selected with its idle intervals. 



val| val 2 val 3 vals valg 

Feature 1 

Feature 2 



poss 



V 



Feature n 



Fig. 5. Semi-randomly generation of detection rules 
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4 Experiment and Result Analysis 



Experiment 1. This experiment is done with the published data set iris which has 4 
numeric features and 150 examples of three classes. The 100 examples of class 1 and 
class 3 are used as self data, and the 50 examples of class 2 are viewed as non-self 
data. 30 examples of class 2, 10 examples of class 1 and 10 examples of class 3 are 
taken away from the sample data for detection test, thereby the training data set 
consists of 100 examples for the classification algorithm. Each numeric feature is 
discretized into 5 equal-length intervals. The schemata and detection rules are learnt 
from the examples shown in Table 1. 



Tablel. Schemata and detection rules learnt from training examples 



No 


Schemata covering self space 


1 


PL€[5.72, 6.90] 


2 


PLe[1.10,2.26] 


3 


PLe[ 4.58, 5.72] andPWE[1.54, 2.02] 


4 


PLe[ 4.58, 5.72] andPWE[2.02, 2.50] 


5 


PLe[ 3.42, 4.58] and SLe[4.30, 5.02] 


6 


PLe[ 4.58, 5.72]andPWE[1.06, 1.54]and SWe [2.00, 2.48] 


7 


PLe[ 4.58,5.72] andPWE[1.06, 1.54]and SWe[ 2.48, 2.96]and SLe[5.74,6.46] 




Detection rules 


1 


If PLe[ 2.26, 3.42] then non-self 


2 


If PLe[ 3.42, 4.58] and SLe([ 5.02,5.74 ]or SLe[ 5.74, 6.46 ]or SLe[ 6.46, 7.18]) 
then non-self 


3 


If PLe[ 4.58, 5.72]and PWe[ 1.06, I.54]and SWe[ 2.00, 2.48] then non-self 


4 


If PLE[4.58,5.72]and PWE[1.06,I.54]and SWe[ 2.48, 2.96]and SLe[ 6.46,7.18] 
then non-self 



Four detection rules are produced by the classifier and marked by light-gray 
squares in Figure3. Another 5 detection rules (dark-gray squares) are semi-randomly 
generated. Common schemata covering self-data are marked by transparent squares. 
All the detection rules are produced around the self data and they separate the data of 
class 1 from the data of class 3. In the testing phase, all the 30 examples of class 2 are 
detected by the nine detectors, and one example of class 3 is mis-recognized as non- 
self data. 




Fig. 6. The distribution of detection rules in two dimensional space 
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Experiment 2. To compare with other methods, we earry out this seeond experiment 
with the network intrusion data set published in KDD Cup 1999. It eontains a wide 
variety of intrusions simulated in a military network environment. The data represent 
both normal and abnormal information with 42 attributes and approximately 
4,900,000 instances, 10% of which are for training and contains only 10 attack types. 
Other 14 attack types are contained in the test data. This makes the detection task 
more realistie. The 10% data set is labelled and thus suitable for elassification 
leamingin our experiments, the 10% data set is still large and is split into two parts 
for training and testing, respectively. The first part contains 60% normal records and 
the reeords of 7 attack types, while the second part for testing contains 40% normal 
data and the records of 10 attack types. 

97 elassifieation rules are learnt from the examples, of which 36 rules identify the 
self-data, and the rest 61 rules eharacterise the non-self space. The shortest rule 
eonsists of 6 attribute-value pairs and the longest rule contains 1 1 attribute -value 
pairs. The rest 80 detection rules are semi-randomly generated. The main 
experimental results are shown in Table2. 



Table 2. Experimental results 



Self common schema 


36 


Directors from non-self rules 


61 


Semi-randomly generated 


80 


Total deteetion rules 


141 


Deteetion rate 


98.67% 


False alarm rate 


3.52% 



In the testing phase, a record is considered to be an anomaly if it is an attack of any 
type. The detection rate is the ratio of the number of abnormal records that are 
eorrectly recognised to the total number of abnormal reeords and, the false alarm rate 
is the ratio of the number of normal records that are incorreetly recognised as 
abnormal to the total number of normal records. 

To illustrate the effect of the number of detection rules on the detection rate and 
the false alarm rate, we set up different total number of detection rules at each run. 
The best number of deteetion rules is thus determined as 141(see Figure 7). 




number of detectors 




number of detection rules 



Fig. 7. Detection rate and false alarm rate change with the number of detectors 
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As we mentioned above, the KDD Cup 1999 Data set has been used in experiments 
for anomaly detection by other researchers. A brief comparison has been given in 
[14], and is also copied to table 3, from which we can see that algorithm EFR 
(Evolving Fuzzy Rules) is the best approach, especially the number of detector rules 
in EFR is the least. We call our approach CSDR, standing for Common Schema-based 
Detector Rules. It also exhibits good performance with a high detection rate(DR), a 
low false alarm rate(FA) and a medium-sized set of detector rules. And the most 
striking characteristic that our approach possesses is that the detector rules are much 
shorter than those in other approaches. The right reason for this is that the common 
schemata in our approach provide much information for the construction of detection 
rules. We also would like to draw readers’ attention to the fact that other approaches 
which employ genetic algorithms or evolutionary algorithms to generate the detection 
rules do not make use of any prior knowledge. 



Table 3. Comparison with other approaches 



Algorithm 


DR% 


FA% 


# Detectors 


EFR 


98.30 


2.0 


15 


PHC 


93.09 


2.0 


47.4 


ERD 


60.90 


2.0 


331.35 


EFRID 


98.95 


7.0 


- 


RIPPER-AA 


94.26 


2.02 


- 


CSDR 


98.67 


3.52 


141 



5 Conclusions 

In this paper, we propose an extended negative selection algorithm for anomaly 
detection. It learns prior knowledge about both the normal and abnormal features 
from historic sample data. The prior knowledge is viewed as common schemata 
characterising the two subspaces: self and non-self, and used to guide the generation 
of detection rules. We use two published data sets in our experiments to test the 
effectiveness of our approach and compare our approaches with other approaches. We 
conclude that: 

• The prior knowledge learnt from examples describes what kinds of schemata 
existing in the two subspaces and thus provides valuable information for the 
construction of detection rules that are semi-randomly generated 

• The proposed approach is effective and efficient. Among all the 6 different 
approaches we compared, our extended negative approach achieved the 
second highest detection rate. The false alarm rate is very competitive. 

• The approach does not rely on the structured representation of the data and 
is applied to the problem of general anomaly detection. 



254 



X. Hang and H. Dai 



References 

[1] Steven A. Hofme5T, and S. Forrest, “Architecture for an artificial immune system”, 
Evolutionary Computation, 8(4) (2000) 443-473. 

[2] D. Dasgupta and S. Forrest, “Novelty detection in time series data using ideas from 
immunology,” in Proceedines of the International Conference on Intelligent Systems, 
pp. 82-87, June 1996. 

[3] Dipankar Dasgupta and Fabio Gonzalez “An immunity-based Technique to 
Characterize Intrusions in Computer Networks”, IEEE transaction on evolutionary 
computation 6(3),pages 1081-1088 June 2002. 

[4] Paul K. harmer, Paul D. Williams, Gregg H.Gunch and Gary B.Lamont, “An 
Artificial Immune System Architure for Computer Security Application” IEEE 
transaction on evolutionary computer, vol.6. No. 3 June 2002. 

[5] Dipankar Dasgupta and Stephanie Forrest, “Artificial immune system in industrial 
application”, In the proceeding of International conference on Intelligent 
Processing and Manufacturing Material (IPMM). Honolulu, HI (July 10-14, 
1999). 

[6] Dipankar Dasgupta and Stephanie Forrest, “Novelty Detection in Time Series data 
using ideas from Immunology”, In the proceedings of the 5th International 
Conference on Intelligent Systems, Reno, June 19-21, 1996. 

[7] Jamie Twycross and Steve Cayzer, “An Immune-based approach to document 
classification”, http://citeseer.nj.nec.com/558965.html. 

[8] S. Forrest, A.Oerelson, L.Allen, and R.cherukuri. “Slef-nonself discrimination in a 
computer”, In the proceedings of IEEE symposium on research in security and 
privacy, 1994. 

[9] Fabio A.Gonzalez and Dipankar Dasgupta, “An Immunogenetic Technique to 
detect animalies in network traffic”, In the proceeding of GECCO 2002 : 1081-1088. 

[10] Jonatan Gomez, Fabio Gonzalez and Dipankar Dasgupta, “An Immune-Fuzzy 
Approach to Anomaly detection”, In Proceedings of The IEEE International 
Conference on Fuzzy Systems, St. Louis, MO, May 2003. 

[11] Fabio Gonzalez, Dipankar Dasgupta and Luis Fernando Nino, “A Randomized Real- 
Value Negative Selection Algorithm”, ICARIS-2003. 

[12] M. Ayara, J. Timmis, R. de Lemos, L. deCastro and R. Duncan, “Negative Selection: 
How to Generate Detectors”, 1st ICARIS, 2002. 

[13] D. Dasgupta, Z.Ji and F. Gonzalez, “Artificial Immune System Research in the last 
five years”. In the proceedings of the international conference on Evolutionary 
Computation Conference (CEC), Canbara, Australia, December 8-12, 2003. ] 

[14] De Castro, L. N. & Von Zuben, F. J., “Learning and Optimization Using the Clonal 
Selection Principle" , IEEE Transactions on Evolutionary Computation, Special Issue 
on Artificial Immune Systems, 6(3), pp. 239-251, 2002. 

[15] Jungwon Kim and Peter Bentley, “Negative selection and Niching by an artificial 
immune system for network intrusion detection”, In the proceeding of Genetic and 
Evolutionary Computation Conference (GECCO '99), Orlando, Florida, July 13-17. 



Adaptive Clustering for Network Intrusion Detection 



Joshua Oldmeadow', Siddarth Ravinutala', and Christopher Leckie^ 

ARC Special Research Centre for Ultra-Broadband Information Networks 
* Department of Electrical and Electronic Engineering 
^ Department of Computer Science and Software Engineering 
The University of Melbourne 
Parkville, Victoria 3010 Australia 
http : / /WWW. cs .mu . oz . au/~caleckie 



Abstract. A major challenge in network intrusion detection is how to perform 
anomaly detection. In practice, the characteristics of network traffic are typi- 
cally non-stationary, and can vary over time. In this paper, we present a solu- 
tion to this problem by developing a time-varying modification of a standard 
clustering technique, which means we can automatically accommodate non- 
stationary traffic distributions. In addition, we demonstrate how feature 
weighting can improve the classification accuracy of our anomaly detection 
system for certain types of attacks. 



1 Introduction 

Network intrusion detection is the problem of identifying misuse of computer net- 
works [1]. A major challenge for any Intrusion Detection System (IDS) is how to 
identify new attacks when the normal background traffic is itself changing with time. 
In this paper, we present a novel approach to detecting network intrusions, based on 
an adaptive clustering technique. In comparison to earlier approaches that assume a 
static model of network traffic (e.g. [2, 3, 4]), we demonstrate that this approach has 
significant advantages in terms of accuracy under changing traffic conditions. 

An important approach for intrusion detection research is anomaly detection using 
unsupervised learning techniques, which do not require labeled data and are able to 
detect previously “unseen” attacks. A critical issue for the success of unsupervised 
anomaly detection is how to cope with changing traffic conditions [5]. Our main 
contribution to this problem has been to develop and evaluate a time varying ap- 
proach to a fixed-width clustering algorithm. A secondary contribution of this paper 
is the investigation of the impact of feature- weighting on the performance of an IDS. 
The performance of these two enhancements on a standard dataset is then compared 
to three other benchmark algorithms presented in [2]. 
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2 Anomaly Detection for Network Security 

The process of anomaly detection comprises two phases: training and testing. The 
training phase involves characterizing a set of given network connections. Each con- 
nection c is represented by a set of d features, i.e., each connection c is represented as 
a point in the feature space 91“^. Note that this training data may contain both normal 
and attack data. In order to build a model that discriminates between normal and 
attack data points, we need to assume that attack data occurs far less frequently than 
normal traffic. As a guide we need to assume that less than x % of the data consists of 
attack connections. The second phase of anomaly detection, the test phase, analyses 
new network connections based upon the information gathered in the training phase. 
New connections are labeled as anomalous or normal based upon the model devel- 
oped in the training phase. 

The problem of anomaly detection for network intrusion has been an active area of 
research. Our approach is based on the model of fixed-width clustering in [2], which 
builds a set of clusters that each has a fixed radius in the feature space. Each cluster is 
represented by a centroid, which is the mean of all the points in the cluster. In the 
training phase of the fixed-width clustering technique, a threshold w is chosen as the 
maximum radius of a cluster. At the end of training, the clusters that contain less than 
a threshold r % of the total set of points are labeled as anomalous. All other clusters 
are labeled as normal. The test phase operates by calculating the distance between a 
new point c and each cluster centroid. If the distance from c to the nearest cluster is 
greater than w, then c lies in a sparse region, and is labeled as anomalous. 

The data set used to test these algorithms was from the 1999 KDD Cup Data Min- 
ing competition [7]. The 1999 KDD Cup was a supervised learning competition, and 
contained labeled training and test datasets. The training data contains 24 attack 
types. The test data contains 14 additional attacks. However, in order to use this da- 
taset for unsupervised learning, we needed to remove two easily detected DOS band- 
width attacks - smurf and neptune. These two attacks constitute just over 79% of the 
training data set, and are very easy to detect by other means. We also removed the 
snmpgetattack, which is almost identical to a normal connection. The new attacks in 
the test set are used to investigate how the system performs when attacks are included 
that the IDS has not previously seen. 



3 Extending Fixed-Width Clustering 

Our anomaly detection technique is based on the method of fixed-width clustering. In 
this section, we first describe the implementation details of how we used the fixed- 
width clustering algorithm described in [3]. We then describe how we adapted this 
algorithm for time-varying clustering, in order to improve its detection accuracy on 
the KDD Cup dataset. 
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3.1 Implementation of Fixed-Width Clustering 

The algorithm for fixed-width clustering is based on the outline in [3]. We are given a 
set of network connections for training, where each connection c, in this set is 
represented by a t/-dimensional vector of features. We then proceed through the fol- 
lowing stages of (1) normalization, (2) cluster formation, and (3) cluster labeling. 

Normalization - To ensure that all features have the same influence when calcu- 
lating distance between connections, we normalised each continuous feature x. in 
terms of the number of standard deviations from the mean of the feature. 

Cluster Formation - After normalization, we measure the distance of each con- 
nection C; in the training set to the centroid of each cluster that has been generated 
so far in the cluster set. If the distance to the closest cluster is less than the threshold 
w, then the centroid of the closest cluster is updated, and the total number of points in 
the cluster is incremented. Otherwise, a new cluster is formed. 

Cluster Labelling - The assumption that the ratio of attack to normal traffic is ex- 
tremely small is used as a classification criterion in this algorithm. If a cluster con- 
tains more than the classification threshold fraction rof the total points in the data set, 
it is labeled as a normal cluster, else it is labeled as anomalous. We found the value t 
= 0.02 was most effective in our evaluation. 

Test Phase - In the testing or real-time phase, each new connection is compared to 
the each cluster to determine whether it is anomalous. The distance from the connec- 
tion to each of the clusters is calculated. If the distance to the nearest cluster is less 
than the cluster width parameter w, then the connection shares the label (normal or 
anomalous) of its nearest cluster. Otherwise, the connection is labeled as anomalous. 

3.2 Our Contributions 

We have investigated two techniques for improving the fixed-width clustering for 
network intrusion detection; feature weighting, and time-varying clusters. 

Feature Weighting - We found that some key features have more of an impact in 
differentiating certain types of attacks from normal traffic. Consequently, we wanted 
to investigate whether feature weighting could improve the performance of fixed- 
width clustering on the test data. Based on a manual inspection of the classification 
accuracy on the training set, we found that the accuracy could be improved by giving 
greater weight to the feature that encodes the average packet size of a connection. 

Time-varying Clusters - Traffic in a network is never stagnant. For example, new 
services can be activated, work patterns can change as projects start or finish, and so 
on. Consequently, an IDS needs to be able to adapt to these changing traffic patterns 
while still maintaining a high level of detection accuracy. 

Since the test-data is of a different distribution to the training dataset, we expect a 
significant improvement in performance when the clusters are allowed to move dur- 
ing real-time operation. In our modified algorithm for time-varying clusters, we allow 
the clusters to be updated during the on-line testing phase. However, if the formula 
used in the training phase were to be used in the testing phase, connections added to 
large clusters would have a lesser influence on the cluster when compared with con- 
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nections added to smaller clusters. Hence an influence factor j is needed to regulate 
the effect of new points on a cluster. When assigning a new connection c. to a cluster 
(j)^ during the on-line test phase, the mean of feature of the cluster is updated as 
follows: 

mean(^^ )'xy + feature (c . ) 

mean (^„ ) = ^ ^ 

y + l 

Table 1 shows how the modified algorithm, with time-varying clusters, performs 
against the four types of attacks. There is a significant improvement in comparison to 
the results obtained when only the feature-weighting enhancement implemented. The 
table shows the percentage improvement resulting from using time-varying clusters 
by comparing the area under the ROC graphs. 



Table 1. Performance of modified fixed-width clustering 



Attack 

Types 


Area 

(original fixed- 
width clustering) 


Area 

(weighting) 


Area 

(weighting and 
time-varying) 


Improvement 

over 

weighted (%) 


DOS 


0.546 


0.982 


0.979 


-0.31% 


Probe 


0.962 


0.895 


0.952 


5.98% 


U2R 


0.961 


0.897 


0.967 


7.24% 


R2L 


0.979 


0.940 


0.960 


2.08% 



As expected, the performance of our algorithm improved against almost all attacks 
with the implementation of time-varying clusters and feature weighting. In order to 
evaluate our approach, we have compared our results to those quoted in [2] for k- 
nearest neighbour, fixed-width clustering, and support vector machines [6]. They 
evaluated all these techniques on the 1999 KDD Cup dataset [7]. 

Allowing the clusters to move in real-time and implementing feature weighting 
allowed us to outperform all the algorithms presented in [2]. Using the area under the 
ROC graph as a measure. Table 2 lists the percentage improvement in performance 
our modified clustering algorithm achieved over the results reported in [2]. 



Table 2. Relative comparison of our time-varying clustering with results from [2] 



Algorithm 


% Improvement using Modified 
Time- Varying Clustering 


K-NN 


8.0% 


Fixed-width Clustering 


3.4% 


SVM 


2.5% 


Our modified TV-Clustering 


- 



As mentioned previously, not only did our test data set contain all the attacks pres- 
ent in the training dataset, but it also included attacks which were not present in the 
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training set. This demonstrates the advantages of our modified time-varying cluster- 
ing algorithm in terms of its ability to detect both new and known attacks. 



4 Conclusion 

In this paper, we have presented two enhancements to using clustering for anomaly 
detection. First, we have demonstrated the benefits of feature weighting to improve 
detection accuracy. Although our feature weighting was based on a manual analysis 
of the detection accuracy on the training data, our results provide a clear motivation 
for further research into automated approaches for feature weighting in this context. 
Second, we developed a time-varying clustering algorithm, which can adapt to 
changes in normal traffic conditions. We demonstrated that our time-varying ap- 
proach was able to achieve significant improvement in detection and false positive 
rates compared to earlier approaches. 
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Abstract. This paper presents an ensemble MML approach for the dis- 
covery of causal models. The component learners are formed based on 
the MML causal induction methods. Six different ensemble causal induc- 
tion algorithms are proposed. Our experiential results reveal that (1) the 
ensemble MML causal induction approach has achieved an improved re- 
sult compared with any single learner in terms of learning accuracy and 
correctness; (2) Among all the ensemble causal induction algorithms ex- 
amined, the weighted voting without seeding algorithm outperforms all 
the rest; (3) It seems that the ensembled Cl algorithms could alleviate 
the local minimum problem. The only drawback of this method is that 
the time complexity is increased by S times, where 5 is the ensemble size. 



1 Introduction 

Discovering causal relationships from a given data with m instances and n vari- 
ables is one of the major challenges in the areas of machine learning and data 
mining. In the last ten years , several projects have been done [1,2, 3, 4] using dif- 
ferent techniques. These techniques can be classified into three major categories: 
one featured in learning Bayesian networks containing compact representations 
for the conditional probability distributions (CPDs) stored at each node. Such 
as the work done by Heckerman[5]. The second category is featured with Con- 
straint based methods. The representative of this category is the TETRAD sys- 
tems done by Peter Spirtes, Clark Glymour and Richard Scheines of the CMU 
group from the Department of Philosophy, Carnegie Mellon University [6]. The 
third category is featured with applying informatic (or asymptotically Bayesian) 
scoring functions such as MML/MDL measures to evaluate the goodness-of-fit of 
the model discovered to the given data. The representative of this category was 
originally done in Monash University headed by Chris Wallace[2,3], and then 
moved to Deakin University directed by Honghua Dai[l,7,4]. 

All of these three categories of work done so far cannot overcome the local 
minimum problem. Due to this, they are usually limited to find the best of a 
local minimum. To further enhance the accuracy and to try to overcome the 
local minimum problem in discovering causal models, this paper investigates an 
ensembling approach that aims to achieve a better result in causal induction. 
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2 Causal Discovery 

Our proposed method integrates the informatic-MML causal induction method 
developed so far and the ensemble learning approaches. 

A causal model, normally in a form of a directed acyclic graph (DAG), is 
a representation of known or inferred causal relations among a set of variables. 
Different flavors of causal network can be defined, depending on the nature of the 
variables concerned (e.g., real or discrete), the form assumed for the effect of one 
variable on another (e.g., linear or non-linear) and the model assumed for the 
variation of variable values which is not accounted for by the variable’s imputed 
causes. Here we are concerned with simple linear effects among real variables, 
and we assume Gaussian unexplained variation. Specifically, if we have a set of 
T real variables, and N independent instances of data, i.e., if we have the data 
{xn,t ■ n = 1 ... N,t = 1 .. . T}, a causal model of the data specifies, for each 
variable Vt, a possibly-empty “parent set” of variables : m = 1, . . . , dt} 

where dt is the number of “parents” of Vt, and a probabilistic relation among 
data values 



where the coefficients {at^m ■ m = 1 . . .dt] give the linear effects of the parents 
on Vt, and the residual t is assumed to be a random variate from the Gaussian 
density A^(0, cr^). Note that we have assumed for simplicity that the mean of each 
variable is zero. 

The model is restricted to ones in which the inter- variable relations are open 
to a physically-tenable causal interpretation. Thus, if we define the “ancestors” 
of a variable as its parents, its parents’ parents, and so on, we require that 
no variable has itself as an ancestor. With this restriction, the topology of the 
causal models among the variables may be represented by a directed acyclic 
graph (DAG) in which each node or vertex represents a variable, and there is a 
directed arc from vt to Vj iff vt is a parent of Vj . 

Our task is: for a given sample data with m instances and T variables, to 
induce the graph structured knowledge with the highest posterior probability 
given such a data set. 

This discovery involves two major processes, encoding and searching as de- 
scribed in [2,7]: The encoding of a causal model includes describing the causal 
model and describing the data as in [2] . 



3 Ensemble Learning 

Ensemble learning is a learning paradigm where multiple component learners 
are trained for the same task, and the outcomes of these component learners are 
combined for dealing with future instances. 




( 1 ) 



m 



L — LModel + L(^jjata\Model) 

The search strategy used in both [2] and [7] is the greedy search method. 



( 2 ) 
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Since an ensemble is often more accurate than its component learners [8,9, 
10], such a paradigm has become a hot topic of supervised learning and has 
already been successfully applied to optical character recognition[ll,12], face 
recognition [13, 14], scientific image analysis[15,16], medical diagnosis [17,18], seis- 
mic signals classification[19], etc. Building a good ensemble involves increasing 
both the accuracy of and the diversity among the component learners [20], which 
is not an easy task. Nevertheless, many practical ensemble methods have been 
developed. 

In general, an ensemble is built in two steps, that is, obtaining multiple com- 
ponent learners and then combining what they learnt. As for obtaining compo- 
nent learners, the most prevailing methods are Bagging[8] and Boosting[21]. Bag- 
ging is introduced by Breiman[8] based on bootstrap sampling [22]. It generates 
several training sets from the original training set and then trains a component 
learner from each of those training sets. Boosting is proposed by Schapire[21] 
and improved by Freund et al [23,24]. It generates a series of component learn- 
ers where the training sets of the successors are determined by the performance 
of the predecessors in the way that training instances that are wrongly pre- 
dicted by the predecessors will play more important roles in the training of the 
successors. There are also many other methods for obtaining the component 
learners. Some examples are as follows. Breiman’s Arc-x4 generates component 
learners by considering the performance of all the previous learners. Bauer and 
Kohavi’s Wagging [25] use Poisson distribution to sample training sets for com- 
ponent learners from the original training set. Webb’s MultiBoost [26] embeds 
Boosting into Wagging to create component learners. Zhou et al.’s GASEN[18] 
employs genetic algorithm to select component learners from a set of trained 
learners. As for combining the predictions of component learners, the methods 
used for classification and regression are quite different because those two kinds 
of tasks have distinct characteristics, i.e. the outputs of classification are class 
labels while those of regression are numeric values. At present, the most prevail- 
ing methods for classification tasks are plurality voting or majority voting[27], 
and those for regression tasks are simple averaging [28] or weighted averaging 
[29]. There are also many other methods, such as employing learning systems 
to combine component predictions[10], exploiting principal component regres- 
sion to determine constrained weights for combining component predictions [30], 
using dynamic weights determined by the confidence of the component learners 
to combine their predictions[31], etc. Up to the knowledge of the authors, the 
research of ensemble learning are mainly focused on supervised learning, yet no 
work has addressed the issue of building ensemble of causal inducers although 
this may not only generate strong causal inducers but also extend the usability 
of ensemble learning methods. 

4 Ensemble MML Causal Induction 

The overall structure of our ensemble MML causal induction process is shown 
in Figure 1. For a given original data set, we generate S data sets with the same 
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sample size as the original data set. For each generated data set Di, the com- 
ponent learner induces a causal model Mi. Then these S models are ensembled 
with some ensemble learning strategy to come up with a final causal model. 




Fig. 1. Processing diagram of Bagging 



4.1 The Building Block 

The component learners are the building block used in an ensembling learn- 
ing method. In this research, we use improved MML causal induction method 
proposed in [7] as the component learners. 

4.2 Ensemble Algorithms 

In total, we proposed six ensemble causal induction algorithms. The differences 
among them lie on two choices: 

Seeding Graph. The component discovery algorithms in BSV and BWV start 
with null graph, while the component discovery algorithms in BMSV and 
BMWV start with seeding graph generated by a Markov Chain Monte Carlo 
algorithm. 

Voting Strategy. Simple voting is used in both BSV and BMSV, where ev- 
ery component has the same voting weight. Howver, in BWV and BMWV, 
weighted voting is used, where the less the message length, the higher the 
voting weight. 

BSV Causal Inducer. BSV stands for Bagging with simple voting which 
is a variant of Bagging [8]. It employs bootstrap sampling [22] to generate several 
training sets from the original data, and induces a linear causal model from each 
of the generated data sets by the MML causal Inducer[7] which uses the improved 
encoding scheme and the Greedy search method. The final structure is obtained 
by majority voting from the individual model structures learnt by individual 
learners. The path coefficients of the final causal model is obtained using MML 
based optimization approach. In this method, no seed model is provided. 
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Algorithm 1 BSV-CI Algorithm 
Input: dataset D, inducer I, ensemble size S 
Output: ensemble M. 

Seed: Empty. 

for s = 1 to 5 do 
Da = bootstrap{D) 

Ms = I{Ds) 
end for 

M.stru = ma j or ity voting (Mb- sir u) 
M.param = M M L_Optim{D , M.stru) 



Algorithm 2 BWV-CI Algorithm 
Input: dataset D, inducer /, ensemble size S 
Output: ensemble M. 

Seed: Empty. 

for s = 1 to S' do 
Da = bootstrap(D) 

Ma = I (Da) 

Wa = MML(Ma, Da) 
end for 

M.stru = weightedvoting(Ma .stru) 

M.param = M M L_Optim(D , M.stru, Ma.param) 



The BSV-CI algorithm is shown in Algorithm 1, where T bootstrap samples 
Di,D 2 , . . . ,Ds are generated from the original data set D and a component 
causal model Mg is obtained from Dg, an ensembled causal model M is built from 
Ml, M 2 , . . . , Mg. The structure of M is determined through majority voting, 
where a connection exists if and only if such a connection existing in majority 
component causal models. The path coefficients of M are determined by MML 
based optimization approach [2]. 

BWV Causal Inducer. BWV stands for Bagging with weighted voting as 
shown in Algorithm 2. It is almost the same as BSV, but during the voting of 
final structure, individual structures be assigned different weights. Weights are 
decided using the message length of the individual causal model. The less the 
message length is, the higher the weight of the structure. 

In BSV-CI, all the component causal models is regarded equally important 
as only simple majority voting is used. In general, the quality of the component 
causal models may be quite different. So, it might be better to attach a weight to 
each component model, which measures the quality of the model to some extent, 
and then perform weighted voting at ensembling. In order to measure the quality 
of the causal models, a possible choice is to use MML measurement^,?]. 

BMSV and BMWV Causal Inducers. These two methods are similar 
to BSV-CI and BWV-CI, except that in BSV-CI the seeding graphs are empty. 
In this method, a seed is provided to the causal inducer. The Seeding structures 
are generated from MCMC. 
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Note that in both BSV-CI and BWV-CI, all the component causal models 
are obtained with a null seed. In fact, start from different seeds may result in 
different causal models. In building ensemble of neural networks, Maclin and 
Shavlik[32] has obtained good results through initializing different component 
neural networks at different points in the weight space. So, another possible 
ensemble strategy is to obtain many seeds, generate a component causal model 
from each seed, and then combine the component models. 

Thus, we obtained BMSV-CI. The algorithm is shown in Algorithm 3. It 
is obvious that seeds sampled from MCMC could also be used to BWV-CI to 
generalize BMWV-CI, as that has been done to BSV-CI. The result is the fourth 
algorithm, which is shown in Algorithm 4. 



Algorithm 3 BMSV-CI Algorithm 
Input: dataset D, inducer I, ensemble size S 
Output: ensemble M. 
for s = 1 to S' do 
Seeds = MCMC(D) 

Ms = I{D, Seeds) 
end for 

M.Stru = maj orityvoting{Ms -Stru) , s = 1, . . . , S 
M. weight = weightsearch[M.Stru, D) 



Algorithm 4 BMWV-CI Algorithm 
Input: dataset D, inducer I, ensemble size S 
Output: ensemble M. 
for s = 1 to S do 
Seeds = MCMC[D) 

Ms = I{D, Seeds) 

Ws = MML(Ms,D) 
end for 

M.Stru = weightedvoting{Ms.Stru), s = 1, . . . , S 
M. weight = weightsearch[M .Stru, D) 



BMSV-S and BMWV-S Causal Inducers. Note that in both BMSV- 
CI and BMWV-CI, the component causal models are directly generated from 
the original data set start from a generated seed model. If bootstrap sampling 
is introduced so that the component causal models are generated from sampled 
data sets with different seeds, another two algorithms, i.e. BMSV-S-CI and 
BMWV-S-CI are obtained, as shown in Algorithm 5 and Algorithm 6. 

In applying all the ensembling algorithms, if in case the ensembled model is 
not a DAG, a further repairing strategy will apply. 
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Algorithm 5 BMSV-S-CI Algorithm 
Input: dataset D, inducer I, ensemble size S 
Output: ensemble M. 
for s = 1 to 5 do 
Da = bootstrap(D) 

Seeds = MCMC{Da) 

Ms = I {Da, Seeds) 
end for 

M.Stru = maj orityvoting{Ma -Stru) , s = 1, . . . , S 
M. weight = weightsearch{M .Stru, D) 



Algorithm 6 BMWV-S-CI Algorithm 
Input: dataset D, inducer I, ensemble size S 
Output: ensemble M. 
for s = 1 to S' do 
Da = bootstrap(D) 

Seeds = MCMC{Da) 

Ms = I {Da, Seeds) 

Wa = MML{Ma,D) 

end for 

M.Stru = weightedvoting{Ms.Stru), s = 1, . . . , S 
M. weight = weightsearch{M .Stru, D) 



5 Empirical Results and Analysis 

In our experiments, we compare six learning algorithms tested on 7 models as 
shown in Table 1(a). For each one of the 7 models, we generate a data set with 
1000 instances. 



Table 1. Experimental Data sets and related Results 



(a) Information of Data Set (b) Performance Comparison of Algorithms 

Data Set Nodes Sample Size Alg Fiji Blau Rodgers Evans Casel2 



Fiji 


4 


1000 


MML-CI 


[2,0,1] 


[0,1,2] 


[2,1,2] 


[2,1,3] 


[0,0,0] 


Evans 


5 


1000 


MCMC 


[2,0,1] 


[0,0,1] 


[0,0,0] 


[3,1,1] 


[0,0,1] 


Blau 


6 


1000 


BMSV 


[2,0,1] 


[0,0,1] 


[0,0,0] 


[3,1,1] 


[0,0,0] 


Rodgers 


7 


1000 


BMWV 


[2,0,1] 


[0,0,1] 


[0,0,0] 


[3,1,1] 


[0,0,0] 


CaseQ 


9 


1000 


BSV 


[2,0,1] 


[0,0,1] 


[2,1,2] 


[2,1,3] 


[0,0,0] 


Case 10 


10 


1000 


BWV 


[2,0,1] 


[0,0,1] 


[0,0,0] 


[1,0,2] 


[0,0,0] 


Case 12 


12 


1000 















The experimental results in Table 1(b) reports the performance comparison 
of the following algorithms: MML-CI, MCMC, BSV, BWV, BMSV and BMWV. 
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(a) Original 



(b) by GS 



(c) by BSV 




(d) by BWV 






(e) by 

MCMC 




(f) by MWV 




(g) by (h) by 

BMSV BMWV 



Fig. 2. Comparison of Discovery Result on Evans 

In Table 1(b), a triple-tuple [to, a, r] is used to represent the results in which 
TO is the number of missing edges, a is the number of added edges, and r is the 
number of reversed edges. 

From Table 1(b) and Figures 2 and 3, we can see that: 

1. All of the four ensemble algorithms works better than the individual causal 
induction algorithm MML-CI, and the MCMC. 

2. Among all the ensemble Cl algorithms, the BWV achieved the best result. 



(a) Original 



(b) by GS 



(c) by BSV 




(d) by BWV 




(e) by 

MCMC 




(f) by MWV 





(g) by (h) by 

BMSV BMWV 



Fig. 3. Comparison of Discovery Result on Rodgers 



Our experiments also show that: 

1. It is interesting to see that weighted voting without seed achieved a better 
result than with seeding. This can be seen from the result of BWV and 
BMWV tested on Evans model. 

2. Bagging with seedings from MCMC sampling doesn’t improve the perfor- 
mance of the causal discovery algorithm. 
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3. For MCMC algorithm, even though the number of samples needed by CMC 
is theoretically polynomial (not exponential) in the dimensionality of the 
search space, in practice it has been found that MCMC does not converge 
in reasonable time, especially when a model has more than about 10 nodes. 

4. As ensemble learning strategy could form the best from a number of discov- 
ered models, and the different seeding model could provide different start 
point in the model space, it is most likely the case that this method could 
overcome the local minimum problem. The performance results seem to sup- 
port this. But it needs to be further proved. 

5. In terms of precision of resulting structure. Bagging with weighted voting 
outperforms all the other algorithms examined, including Message Length- 
based Greedy Causal Discovery, MCMC algorithm, etc. 

Table 2 and 3 list the results of the 6 algorithms with various sample size 
from 3 to 19, there is no indication showing that a larger size achieves a better 
result or vice versa. 



Table 2. MML on Rodgers under different ensemble size 



Alg 3 7 11 15 19 

MCMC 52.42 57.63 52.42 53.23 63.30 
MWV 52.42 52.42 52.42 52.42 61.96 
BMSV 52.42 52.42 52.42 52.42 61.96 
BMWV52A2 52.42 52.42 52.42 52.42 
BSV 62.28 62.28 62.28 62.28 62.28 
BWV 52.06 52.06 52.06 52.06 52.06 
(93 in front of each number is ommitted. So 53.2 should be ”9353.2) 



Table 3. Number of Errors on Rodgers under different ensemble size 



Alg 3 5 7 9 11 13 15 17 19 



MCMC 


5 3 4 5 


4 


0 


5 


1 


7 


MWV 


3 3 3 3 


3 


0 


3 


0 


7 


BMSV 


3 3 3 3 


3 


0 


3 


0 


7 


BMWV 3333 


3 


0 


3 


0 


3 


BSV 


5 5 5 5 


5 


5 


5 


5 


5 


BWV 


0 0 0 0 


0 


0 


0 


0 


0 



6 Conclusions 



In this paper, we proposed an ensemble MML causal Induction method. Several 
ensembling strategies are examined. Compared with this new Causal induction 
algorithm based on Bagging with the previous algorithms [2,7], the following 
conclusions appear to be supported by our experimental results: 
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1. In terms of correctness and accuracy, Bagging ensembling approach with 
weighted voting without seeding outperforms all the other algorithms exam- 
ined. 

2. Large ensemble size does not seem to be able to improve performance. 

3. The model induced by ensembling algorithms is almost always better than 
that by any single causal learner. 
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Abstract. In this paper we upgrade linear logistic regression and boost- 
ing to multi-instance data, where each example consists of a labeled bag 
of instances. This is done by connecting predictions for individual in- 
stances to a bag-level probability estimate by simple averaging and max- 
imizing the likelihood at the bag level — in other words, by assuming that 
all instances contribute equally and independently to a bag’s label. We 
present empirical results for artificial data generated according to the 
underlying generative model that we assume, and also show that the two 
algorithms produce competitive results on the Musk benchmark datasets. 



1 Introduction 

Multi-instance (MI) learning differs from standard supervised learning in that 
each example is not just a single instance: examples are collections of instances, 
called “bags”, containing an arbitrary number of instances. However, as in stan- 
dard supervised learning there is only one label for each example (i.e. bag). This 
makes MI learning inherently more difficult because it is not clear how the bag 
label relates to the individual instances in a bag. The standard way to approach 
the problem is to assume that there is one “key” instance in a bag that trig- 
gers whether the bag’s class label will be positive or negative.^ This approach 
has been proposed for predicting the activity of a molecule, where the instances 
correspond to different shapes of the molecule, and the assumption is that the 
presence of a particular shape determines whether a molecule is chemically active 
or not [1]. The task is then to identify these key instances. A different approach 
is to assume that all instances contribute equally and independently to a bag’s 
class label. In the above context this corresponds to assuming that the different 
shapes of a molecule have a certain probability of being active and the average 
of these probabilities determines the probability that the molecule will be ac- 
tive. In this paper, we use the latter approach to upgrade logistic regression and 
boosting to MI problems. 

The paper is organized as follows. In Section 2 we explain the underlying gen- 
erative model that we assume and illustrate it using artificial MI data based on 

^ Throughout this paper we assume a classification task with two possible classes. 
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this model. In Section 3 we describe how we use the generative model to upgrade 
linear logistic regression and boosting to MI learning. In Section 4 we evaluate 
the resulting algorithms on both the artificial data and the Musk benchmark 
problems [1]. Section 5 summarizes our results. 

2 A Generative Model 

We assume that the class label of a bag is generated in a two-stage process. The 
first stage determines class probabilities for the individual instances in a bag, 
and the second stage combines these probability estimates in some fashion to 
assign a class label to the bag. The difficulty lies in the fact that we cannot 
observe class labels for individual instances and therefore cannot estimate the 
instance-level class probability function directly. 

This general two-stage framework has previously been used in the Diverse 
Density (DD) algorithm [2] to learn a classifier for MI problems. In the first 
step, DD assumes a radial (or “Gaussian-like” ) formulation for the instance- 
level class probability function Pr{y\x) (where y is a class label — either 0 or 
1 — and X an instance). In the second step, it assumes that the values of Pr{y\x) 
for the instances in a bag are combined using either a multi-stage (based on the 
noisy-or model) or a one-stage (based on the most-likely-cause model) Bernoulli 
process to determine the bag’s class label. In both cases DD essentially attempts 
to identify ellipsoid areas of the instance space for which the probability of 
observing a positive instance is high, i.e, it attempts to identify areas that have 
a high probability of containing “key” instances. The parameters involved are 
estimated by maximizing the bag-level likelihood. 

In this paper we use the same two-stage framework to upgrade linear logis- 
tic regression and boosting to MI data. However, we assume a different process 
for combining the instance-level class probability estimates, namely that all in- 
stances contribute equally to a bag’s class probability. Consider an instance-level 
model that estimates Pr{y\x) or the log-odds function log ; and a bag 

b that corresponds to a certain area in the instance space. Given this bag b with 
n instances Xi £ 6, we assume that the bag-level class probability is either given 



by 




( 1 ) 



or by 





Pr{y = 1|6) 
Pr{y = 0|6) 



[YYl Pr{y=i\xi)]^/^ 



[nr Pr{y=l\xi)Y/'^ + [Y\7 Pr{y=0\xi)Y^ 
[m Ppy=o\x=i)\"'"' 



(nr P‘r{y=l\xi)Y/^ + [Y[7 Pr{y=0\xi)Y/ 



( 2 ) 
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Fig. 1. An artificial dataset with 20 bags. 



Equation 1 is based on the arithmetic mean of the instance-level class probability 
estimates while Equation 2 involves the geometric mean. 

The above generative models are best illustrated based on an artificial 
dataset. Consider an artificial domain with two attributes. We create bags of 
instances by defining rectangular regions in this two-dimensional instance space 
and sampling instances from within each region. First, we generate coordinates 
for the centroids of the rectangles according to a uniform distribution with a 
range of [—5,5] for each of the two dimensions. The size of a rectangle in each 
dimension is chosen from 2 to 6 with equal probability. Each rectangle is used 
to create a bag of instances. For sampling the instances, we assume a symmetric 
triangle distribution with ranges [—5,5] in each dimension (and density function 
f{x) = 0.2 — O.Odjxj). From this distribution we sample n instances from within 
a rectangle. 

The number of instances n for each bag is chosen from 1 to 20 with equal 
probability. We define the instance-level class probability by the linear logistic 
model Pr{y = l\xi,X 2 ) = where Xi and X 2 are the two attribute val- 

ues of an instance. Thus the log-odds function is a simple linear function, which 
means the instance-level decision boundary is a line. Then we take Equation 2 
to calculate Pr{y\b). Finally we label each bag by flipping a coin according to 
this probability. 

Figure 1 shows a dataset with 20 bags that was generated according to this 
model. The black line in the middle is the instance-level decision boundary (i.e. 
where Pr{y = ljxi,a; 2 ) = 0.5) and the sub-space on the right side has instances 
with higher probability to be positive. A rectangle indicates the region used to 
sample points for the corresponding bag (and a dot indicates its centroid). The 
top-left corner of each rectangle shows the bag index, followed by the number of 
instances in the bag. Bags in gray belong to class “negative” and bags in black to 
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class “positive” . Note that bags can be on the “wrong” side of the instance-level 
decision boundary because each bag was labeled by flipping a coin based on its 
class probability. 

3 Upgrading Linear Logistic Regression and Boosting 

In this section we first show how to implement linear logistic regression based on 
the above generative model. Then we do the same for additive logistic regression 
(i.e. boosting) — more specifically, AdaBoost.Ml [3]. 



3.1 Linear Logistic Regression 

The standard logistic regression model does not apply to multi-instance data 
because the instances’ class labels are masked by the “collective” class label of 
a bag. However, suppose we know how the instance-level class probabilities are 
combined to form a bag-level probability, say, by Equation 2. Then we can apply 
a standard optimization method (e.g. gradient descent) to search for a maximum 
of the (bag-level) binomial likelihood based on the training data. This gives us 
an indirect estimate of the instance-level logistic model. 

The instance-level class probabilities are given by Pr{y = l|a;) = 1/(1 -I- 
exp(— /3x)) and Pr{y = 0|a;) = 1/(1 -I- exp(/3a;)) respectively, where j3 is the 
parameter vector to be estimated. If we assume bag-level probabilities are given 
by Equation 2, this results in the bag-level model 

Pr(n, _ 1 IM _ [fir _ exp(//3E^Xj) 

- i|Oj - [n"Pr(y=l|x,)]i/"-K[nrPKy=Oki)]V" - l+exp(i/3E.Xi) 

_ 016 ) - in: Pr{y= 0 \^i)]^'- 1 ■ 

ppy - - [U: Pr{v=l\xi)]^/^ + [U: Pr{y= 0 \xi)]^/^ - l+exp(i/3EiXi) 

Based on this we can estimate the parameter vector /3 by maximizing the 
bag-level binomial log-likelihood function: 

N 

LL = Y^[y,logPr{y = 1|6) -k (1 - y^)\ogPr{y = 0|6)] (4) 

i 

where N is the number of bags. 

Note that this problem can actually be reduced to standard single-instance 
logistic regression by converting each bag of instances into a single instance 
representing the bag’s mean (because the instances only enter Equation 3 in 
the sum The reason for this is the simple linear structure of the model 

and the fact that the bag-level probabilities are generated based on the geo- 
metric mean (Equation 2). In practice, it is usually not possible to say whether 
the geometric mean is more appropriate than the arithmetic one, so we may 
want to use Equation 1 as an alternative. In that case, we change the formula- 
tion of Pr{y\h) based on Equation 1 and use this in conjunction with the same 
log- likelihood function (Equation 4). This problem can no longer be solved by 
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1. Initialize weight of each bag to Wi = 1/N , i = 1, 2, . . . ,N. 

2. Repeat for m = 1,2,... , M: 

a) Set Wij ■«— Wilui, assign the bag’s class label to each 

of its instances, and build an instance-level model hm{xij) G { — 1, 1}. 

b) Within the bag (with m instances), compute the error rate Ci G [0, 1] 
by counting the number of misclassified instances within that bag, 

i.e. a = 

c) If €i < 0.5 for all i’s, go to Step 3. 

d) Compute Cm = argmin^^ Wi exp[(2ei — l)c,„] using numeric optimization. 

e) If(cm < 0), go to Step 3. 

f) Set Wi •<— Wi exp[(2ci — l)cm] and renormalize so that Wi = 1. 

3. return sign[£^. Cmhm{xj)]. 



Fig. 2. The MIBoosting Algorithm. 



applying the above transformation in conjunction with standard single-instance 
logistic regression. In the remainder of this paper, we will call the former method 
MILogisticRegressionGEOM and the latter one MILogisticRegressionARITH. 

As usual, the maximization of the log-likelihood function is carried out via 
numeric optimization because there is no direct analytical solution. The opti- 
mization problem can be solved very efficiently because we are working with a 
linear model. (Note that the radial formulation in the DD [2] algorithm, for ex- 
ample, poses a difficult global optimization problem that is expensive to solve.) In 
our implementation we use a quasi-Newton optimization procedure with BFGS 
updates suggested in [4], searching for parameter values around zero. 

3.2 Boosting 

Both linear logistic regression and the model used by the DD algorithm assume 
a limited family of patterns. Boosting is a popular algorithm for modeling more 
complex relationships. It constructs an ensemble of so-called “weak” classifiers. 
In the following we explain how to upgrade the AdaBoost.Ml algorithm to MI 
problems, assuming the same generative model as in the case of linear logis- 
tic regression. Note that AdaBoost.Ml can be directly applied to MI problems 
(without any modifications) if the “weak” learner is a full-blown MI learner, 
in the same way as ensemble classifiers for MI learning have been built using 
bagging in [5] . In the following we consider the case where the weak learner is a 
standard single-instance learner (e.g. G4.5 [6]). 

Boosting, more specifically, AdaBoost.Ml, originates from work in compu- 
tational learning theory [3], but received a statistical explanation as additive 
logistic regression in [7]. It can be shown that it minimizes an exponential loss 
function in a forward stage- wise manner, ultimately estimating the log-odds func- 
tion I log based on an additive model [7]. To upgrade AdaBoost.Ml 

we use the generative model described in Section 2 in conjunction with Equa- 
tion 2 (i.e. the geometric average). The pseudo code for the algorithm, called 
“MIBoosting” is shown in Figure 2. Here, N is the number of bags, and we 
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use the subscript i to denote the bag, where i = 1,2,... ,N. There are rn 
instances in the bag, and we use the subscript j to refer to the instance, 
where j = 1, 2, . . . , n^. The instance in the bag is Xij. Note that, for conve- 
nience, we assume that the class label of a bag is either 1 or -1 (i.e. y S {1, — 1}), 
rather than 1 or 0. 

The derivation of the algorithm is analogous to that in [7]. We regard the 
expectation sign E as the sample average instead of the population expectation. 
We are looking for a function (i.e. classifier) F{b) that minimizes the exponential 
loss i?B£'>^|B[exp(— yF(&))]. In each iteration of boosting, the aim is to expand 
F{h) into F{b)+cf{b) (i.e. adding a new “weak” classifier) so that the exponential 
loss is minimized. In the following we will just write E[.] to denote EbEy\b[-]j 
and use to denote the weighted expectation, as in [7]. 

In each iteration of the algorithm, we search for the best f{b) to add to the 
model. Second order expansion of exp{—ycf{b)) about f{b) = 0 shows that we 
can achieve this by searching for the maximum of E^[yf{b)], given bag-level 
weights Wb = exp(— yJ^(6)) [7]. If we had an Ml-capable weak learner at hand 
that could deal with bag weights, we could estimate /(&) directly. However we 
are interested in wrapping our boosting algorithm around a single-instance weak 
learner. Thus we expand f{b) into f{b) = h{xj)/n, where h{xj) € {—1, 1} is 
the prediction of the weak classifier h{.) for the instance in b. We are seeking 
a weak classifier h{.) that maximizes 



E^[yh{xb)/n] = 

■ 1 • 1 
1=1 j=i 

This is maximized if h{xij) = yi, which means that we are seeking a classifier 
h{.) that attempts to correctly predict the label of the bag that each instance 
pertains to. More precisely, h{.) should minimize the weighted instance-level 
classification error given instance weights — (assuming we give every instance its 
bag’s label). This is exactly what standard single-instance learners are designed 
to do (provided they can deal with instance weights). Hence we can use any 
(weak) single-instance learner to generate the function h{.) by assigning each 
instance the label of its bag and the corresponding weight — . This constitutes 
Step 2a of the algorithm in Figure 2. 

Now that we have found /(6), we have to look for the best multiplier c > 0. 
To do this we can directly optimize the objective function 

EBEY\B[exp{-yF{b) + c{-yf{b)))] = ^ W, exp[c„(-^^^b_^^)] 

i 

= '^W,exp[{2e, - l)c™], 

i 

where (computed in Step 2b). 

Minimization of this expectation constitutes Step 2d. Note that this function 
will not have a global minimum if all < 0.5. In that case all bags will be 
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correctly classified by /(6), and no further boosting iterations can be performed. 
Therefore this is checked in Step 2c. This is analogous to what happens in 
standard AdaBoost.Ml if the current weak learner has zero error on the training 
data. 

Note that the solution of the optimization problem involving c may not 
have an analytical form. Therefore we simply search for c using a quasi-Newton 
method in Step 2d. The computational cost for this is negligible compared to the 
time it takes to learn the weak classifier. The resulting value for c is not neces- 
sarily positive. If it is negative we can simply reverse the prediction of the weak 
classifier and get a positive c. However, we use the AdaBoost.Ml convention and 
stop the learning process if c becomes negative. This is checked in Step 2e. 

Finally, we update the bag-level weights in Step 2f according to the additive 
structure of F{b), in the same way as in standard AdaBoost.Ml. Note that, 
the more misclassified instances a bag has, the greater its weight in the next 
iteration. This is analogous to what happens in standard AdaBoost.Ml at the 
instance level. 

To classify a test bag, we can simply regard F{b) as the bag-level log-odds 
function and take Equation 2 to make a prediction (Step 3). An appealing prop- 
erty of this algorithm is that, if there is only one instance per bag, i.e. for single- 
instance data, the algorithm naturally degenerates to normal AdaBoost.Ml. To 
see why, note that the solution for c in Step 2d will be exactly | log , 

where erry/ is the weighted error. Hence the weight update will be the same as 
in standard AdaBoost.Ml. 

4 Experimental Results 

In this section we first discuss empirical results for the artificial data from Sec- 
tion 2. Then we evaluate our techniques on the Musk benchmark datasets. 



4.1 Results for the Artificial Data 

Because we know the artificial data from Section 2 is generated using a linear 
logistic model based on the geometric mean formulation in Equation 2, MIL- 
ogisticRegressionGEOM is the natural candidate to test on this specific data. 
Figure 3 shows the parameters estimated by this method when the number of 
bags increases. As expected, the estimated parameters converge to the true pa- 
rameters asymptotically. However, more than 1000 bags are necessary to produce 
accurate estimates. 

To evaluate the performance of MIBoosting we created an independent test 
set with 10,000 bags based on the same generative model. As the “weak” learner 
we used decision stumps, performing 30 iterations of the boosting algorithm. 
The test error for different numbers of training bags is plotted in Figure 4, and 
compared to that of MI linear logistic regression. Not surprisingly, the latter out- 
performs boosting because it matches the underlying generative model exactly. 
However, both methods approach the optimum error rate on the test data. 
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Fig. 3. Parameters estimated by MI Logis- Fig. 4. Test error rates for boosting and 
tic Regression on the artificial data. logistic regression on the artificial data. 



4.2 Results for the Musk Problems 

In this section we present results for the Musk drug activity datasets [1]. The 
Musk 1 data has 92 bags and 476 instances, the Musk 2 data 102 bags and 6598 
instances. Both are two-class problems with 166 attributes. Because of the large 
number of attributes, some regularization proved necessary to achieve good re- 
sults with logistic regression. To this end we added an L 2 penalty term A||/3|p 
to the likelihood function in Equation 4, where A is the ridge parameter.^ In our 
experiments we set A = 2. For boosting, we used C4.5 trees as the “weak” clas- 
sifiers and 50 boosting iterations. We introduced some regularization by setting 
the minimum number of instances per leaf to twice the average bag size (10 in- 
stances for Musk 1, and 120 instances for Musk 2). The post-pruning mechanism 
in C4.5 was turned off. 

The upper part of Table 1 shows error estimates for the two variants of 
MILogisticRegression and for MIBoosting. These were obtained by 10 runs of 
stratified 10-fold cross-validation (CV) (at the bag level). The standard deviation 
of the 10 estimates from the 10 runs is also shown. The lower part summarizes 
some results for closely related statistical methods that can be found in the 
literature. A summary of results for other multi-instance learning algorithms 
can be found in [8]. Note also that the performance of MI learning algorithms 
can be improved further using bagging, as shown in [5]. 

The first result in the lower part of Table 1 is for the Diverse Density algo- 
rithm applied in conjunction with the noisy-or model [2]. This result was also 
obtained by 10-fold cross-validation. The second result was produced by a neural 
network. The MI Neural Network is a multi-layer perceptron adapted to multi- 
instance problems by using a soft-max function to hone in on the “key” instances 
in a bag [9].^ The last entry in Table 1 refers to a support vector machine (SVM) 
used in conjunction with a Gaussian multi-instance kernel [8]. This method re- 

^ We standardized the training data using the weighted mean and variance, each 
instance being weighted by the inverse of the number of instances in its bag. 

® It is unclear how the error estimates for the neural network were generated. 
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Table 1. Error rate estimates for the Musk datasets and standard deviations (if avail- 
able) . 



Method 


Musk 1 


Musk 2 


MILogisticRegressionGEOM 


14.13T2.23 17.74il.17 


MILogisticRegressionARITH 


13.26il.83 15.88il.29 


MIBoosting with 50 iterations 


12.07il.95 15.98il.31 


Diverse Density 


11.1 


17.5 


MI Neural Network 


12.0 


18.0 


SVM with Gaussian MI kernel 


13.6il.l 


12.0il.0 




Fig. 5. The effect of varying the number of boosting iterations on the Muskl data. 



places the dot product between two instances (used in standard SVMs) by the 
sum of the dot products over all possible pairs of instances from two bags. This 
method resembles ours in the sense that it also implicitly assumes all instances 
in a bag are equally relevant to the classification of the bag. The error-estimates 
for the SVM were generated using leave-lO-out, which produces similar results 
as 10-fold cross-validation on these datasets. 

The results for logistic regression indicate that the arithmetic mean (Equa- 
tion 1) is more appropriate than the geometric one (Equation 2) on the Musk 
data. Furthermore, boosting improves only slightly on logistic regression. For 
both methods the estimated error rates are slightly higher than for DD and the 
neural network on Musk 1, and slightly lower on Musk 2. Compared to the SVM, 
the results are similar for Musk 1, but the SVM appears to have an edge on the 
Musk 2 data. However, given that logistic regression finds a simple linear model, 
it is quite surprising that it is so competitive. 

Figure 5 shows the effect of different numbers of iterations in the boosting 
algorithm. When applying boosting to single-instance data, it is often observed 
that the classification error on the test data continues to drop after the error on 
the training data has reached zero. Figure 5 shows that this effect also occurs in 
the multi-instance version of boosting — in this case applied to the Musk 1 data 
and using decision stumps as the weak classifiers. 
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Figure 5 also plots the Root Relative Squared Error (RRSE) of the probabil- 
ity estimates generated by boosting. It shows that the classification error on the 
training data is reduced to zero after about 100 iterations but the RRSE keeps 
decreasing until around 800 iterations. The generalization error, estimated using 
10 runs of 10-fold CV, also reaches a minimum around 800 iterations, at 10.44% 
(standard deviation: 2.62%). Note that this error rate is lower than the one for 
boosted decision trees (Table 1). However, boosting decision stumps is too slow 
for the Musk 2 data, where even 8000 iterations were not sufficient to minimize 
the RRSE. 

5 Conclusions 

We have introduced multi-instance versions of logistic regression and boosting, 
and shown that they produce results comparable to other statistical multi- 
instance techniques on the Musk benchmark datasets. Our multi-instance al- 
gorithms were derived by assuming that every instance in a bag contributes 
equally to the bag’s class label — a departure from the standard multi-instance 
assumption that relates the likelihood of a particular class label to the presence 
(or absence) of a certain “key” instance in the bag. 
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Abstract. Supporting continuous mining queries on data streams requires 
algorithms that (i) are fast, (ii) make light demands on memory resources, 
and (iii) are easily to adapt to concept drift. We propose a novel boosting 
ensemble method that achieves these objectives. The technique is based on 
a dynamic sample-weight assignment scheme that achieves the accuracy of 
traditional boosting without requiring multiple passes through the data. The 
technique assures faster learning and competitive accuracy using simpler 
base models. The scheme is then extended to handle concept drift via change 
detection. The change detection approach aims at significant data changes 
that could cause serious deterioration of the ensemble performance, and 
replaces the obsolete ensemble with one built from scratch. Experimental results 
confirm the advantages of our adaptive boosting scheme over previous approaches. 

Keywords: Stream data mining, adaptive boosting ensembles, change detection 



1 Introduction 

A substantial amount of recent work has focused on continuous mining of data streams [4, 
10,1 1,15,16]. Typical applications include network traffic monitoring, credit card fraud 
detection and sensor network management systems. Challenges are posed by data ever 
increasing in amount and in speed, as well as the constantly evolving concepts underlying 
the data. Two fundamental issues have to be addressed by any continuous mining attempt. 

Performance Issue. Constrained by the requirement of on-line response and by lim- 
ited computation and memory resources, continuous data stream mining should conform 
to the following criteria; (1) Learning should be done very fast, preferably in one pass 
of the data; (2) Algorithms should make very light demands on memory resources, for 
the storage of either the intermediate results or the final decision models. These fast and 
light requirements exclude high-cost algorithms, such as support vector machines; also 
decision trees with many nodes should preferably be replaced by those with fewer nodes 
as base decision models. 

Adaptation Issue. For traditional learning tasks, the data is stationary. That is, the 
underlying concept that maps the features to class labels is unchanging [ 1 7] . In the context 
of data streams, however, the concept may drift due to gradual or sudden changes of the 
external environment, such as increases of network traffic or failures in sensors. In fact, 
mining changes is considered to be one of the core issues of data stream mining [5]. 
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In this paper we focus on continuous learning tasks, and propose a novel Adaptive 
Boosting Ensemble method to solve the above problems. In general, ensemble methods 
combine the predictions of multiple base models, each learned using a learning algorithm 
called the base learner [2]. In our method, we propose to use very simple base models, 
such as decision trees with a few nodes, to achieve fast and light learning. Since simple 
models are often weak predictive models by themselves, we exploit boosting technique 
to improve the ensemble performance. The traditional boosting is modihed to handle 
data streams, retaining the essential idea of dynamic sample-weight assignment yet 
eliminating the requirement of multiple passes through the data. This is then extended to 
handle concept drift via change detection. Change detection aims at significant changes 
that would cause serious deterioration of the ensemble performance. The awareness of 
changes makes it possible to bnild an active learning system that adapts to changes 
promptly. 

Related Work. Ensemble methods are hardly the only approach used for continuous 

learning. Domingos et al. [4] devised a novel decision tree algorithm, the Hoeffding tree, 
that performs asymptotically the same as or better than its batched version. This was 
extended to C VFDT in an attempt to handle concept drift [11]. But, Hoeffding-tree like 
algorithms need a large training set in order to reach a fair performance, which makes 
them unsuitable to situations featuring frequent changes. Domeniconi et al. [3] designed 
an incremental support vector machine algorithm for continuous learning. 

There has been work related to boosting ensembles on data streams. Fern et al. [6] 
proposed online boosting ensembles, and Oza et al. [12] studied both online bagging and 
online boosting. Frank et al. [7] used a boosting scheme similar to our boosting scheme. 
But none of these work took concept drift into consideration. 

Previous ensemble methods for drifting data streams have primarily relied on 
bagging-style techniques [15,16]. Street et al. [15] gave an ensemble algorithm that 
builds one classifier per data block independently. Adaptability relies solely on retiring 
old classifiers one at a time. Wang et al. [16] nsed a similar ensemble building method. 
But their algorithm tries to adapt to changes by assigning weights to classifiers pro- 
portional to their accuracy on the most recent data block. As these two algorithms are 
the most related, we call them Bagging and Weighted Bagging, respectively, for later 
references in our experimental comparison. ^ 

This paper is organized as follows. Our adaptive boosting ensemble method is pre- 
sented in section 2, followed by a change detection technique in section 3. Section 4 
contains experimental design and evaluation results, and we conclude in section 5. 



2 Adaptive Boosting Ensembles 

We use the boosting ensemble method since this learning procedure provides a number 
of formal guarantees. Freund and Schapire proved a number of positive results about its 
generalization performance [13]. More importantly, Friedman et al. showed that boosting 
is particularly effective when the base models are simple [9]. This is most desirable for 
fast and light ensemble learning on steam data. 

* The name “bagging” derives from their analogy to traditional bagging ensembles [1]. 
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Algorithm 1 Adaptive boosting ensemble algorithm 

Ensure: Maintaining a boosting ensemble Ei, with classifiers {Ci, • • • , Cm}, m < M. 
1: while (1) do 

2: Given a new block Bj = , [Xn, yn)}, where yi G {0, 1}, 

3: Compute ensemble prediction for sample i: Eb{xi) = round(^ Ck(xi}), 

4: Change Detection: _E{, 0 if a change detected! 

5: if {Eb f 0) then 

6: Compute error rate of Eb on Bj\ Cj = 

7: Set new sample weight Wi — (1 — ej)jej if Ehixf) yi', wt = 1 otherwise 

8: else 

9: set Wi = 1, for all i. 

10: end if 

1 1 : Learn a new classifier Cm+i from weighted block Bj with weights {uti}, 

12: Update Eb'. add Cm+i, retire Ci if m = M. 

13: end while 



In its original form, the boosting algorithm assumes a static training set. Earlier 
classifiers increase the weights of misclassified samples, so that the later classifiers will 
focus on them. A typical boosting ensemble usually contains hundreds of classifiers. 
However, this lengthy learning procedure does not apply to data streams, where we have 
limited storage but continuous incoming data. Past data can not stay long before making 
place for new data. In light of this, our boosting algorithm requires only two passes of 
the data. At the same time, it is designed to retain the essential idea of boosting — the 
dynamic sample weights modification. 

Algorithm 1 is a summary of our boosting process. As data continuously flows 
in, it is broken into blocks of equal size. A block Bj is scanned twice. The first pass 
is to assign sample weights, in a way corresponding to AdaBoost.Ml [8]. That is, if 
the ensemble error rate is Cj, the weight of a misclassified sample Xi is adjusted to 
be Wi = (1 — e.j)/ej. The weight of a correctly classified sample is left unchanged. 
The weights are normalized to be a valid distribution. In the second pass, a classifier is 
constructed from this weighted training block. 

The system keeps only the most recent classifiers, up to M. We use a traditional 
scheme to combine the predictions of these base models, that is, by averaging the proba- 
bility predictions and selecting the class with the highest probability. Algorithm 1 is for 
binary classification, but can easily be extended to multi-class problems. 

Adaptability Note that there is a step called “Change Detection" (line 4) in 
Algorithm 1. This is a distinguished feature of our boosting ensemble, which guarantees 
that the ensemble can adapt promptly to changes. Change detection is conducted at every 
block. The details of how to detect changes are presented in the next section. 

Our ensemble scheme achieves adaptability by actively detecting changes and dis- 
carding the old ensemble when an alarm of change is raised. No previous learning 
algorithm has used such a scheme. One argument is that old classifiers can be tuned to 
the new concept by assigning them different weights. Our hypothesis, which is borne 
out by experiment, is that obsolete classifiers have bad effects on overall ensemble per- 
formance even they are weighed down. Therefore, we propose to learn a new ensemble 
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from scratch when changes occur. Slow learning is not a concern here, as our base learner 
is fast and light, and boosting ensures high accuracy. The main challenge is to detect 
changes with a low false alarm rate. 



3 Change Detection 



In this section we propose a technique for change detection based on the framework 
of statistical decision theory. The objective is to detect changes that cause significant 
deterioration in ensemble performance, while tolerating minor changes due to random 
noise. Here, we view ensemble performance 0 as a random variable. If data is stationary 
and fairly uniform, the ensemble performance fluctuations are caused only by random 
noise, hence 0 is normally assumed to follow a Gaussian distribution. When data changes, 
yet most of the obsolete classifiers are kept, the overall ensemble performance will 
undergo two types of decreases. In case of an abrupt change, the distribution of 0 will 
change from one Gaussian to another. This is shown in Figure 1(a). Another situation is 
when the underlying concept has constant but small shifts. This will cause the ensemble 
performance to deteriorate gradually, as shown in Figure 1(b). Our goal is to detect both 
types of significant changes. 



8 - 


0.8 








0.8 






I 






(a) 






(b) 



Fig. 1. Two types of significant changes. Type I: abrupt changes; Type II: gradual changes over a 
period of time. These are the changes we aim to detect. 



Every change detection algorithm is a certain form of hypothesis test. To make a 
decision whether or not a change has occurred is to choose between two competing 
hypotheses: the null hypothesis Hq or the alternative hypothesis TLi, corresponding to 
a decision of no-change or change, respectively. Suppose the ensemble has an accuracy 
9j on block j. If the conditional probability density function (pdf) of 0 under the null 
hypothesis p{0\Ho) and that under the alternative hypothesis p{0\'Hi) are both known, 
we can make a decision using a likelihood ratio test: 



m) 



Pidjini) 

pi^jlTio) no 



(1) 



The ratio is compared against a threshold t. T-Li is accepted if L{0j) > r, and rejected 
otherwise, r is chosen so as to ensure an upper bound of false alarm rate. 
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Now consider how to detect a possible type I change. When the null hypothesis Ho 
(no change) is true, the conditional pdf\& assumed to be a Gaussian, given by 



where the mean ^o and the variance CTq can be easily estimated if we just remember 
a sequence of most recent 0’s. But if the alternative hypothesis Hi is true, it is not 
possible to estimate H(0|Hi) before sufficient information is collected. This means a 
long delay before the change could be detected. In order to do it in time fashion, we 
perform a significance test that uses Ho alone. A significant test is to assess how well 
the null hypothesis Ho explains the observed 0. Then the general likelihood ratio test in 
Equation 1 is reduced to: 



When the likelihood p(0j|Ho) > t, the null hypothesis is accepted; otherwise it is 
rejected. Significant tests are effective in capturing large, abrupt changes. 

For type II changes, we perform a typical hypothesis test as follows. First, we split 
the history sequence of 0’s into two halves. A Gaussian pdf cw. be estimated from each 
half, denoted as Gq and Gi. Then a likelihood ratio test in Equation 1 is conducted. 

So far we have described two techniques aiming at two types of changes. They 
are integrated into a two-stage method as follows. As a first step, a significant test is 
performed. If no change is detected, then a hypothesis test is performed as a second step. 
This two-stage detection method is shown to be very effective experimentally. 

4 Experimental Evaluation 

In this section, we first perform a controlled study on a synthetic data set, then apply the 
method to a real-life application. 

In the synthetic data set, a sample a; is a vector of three independent features < Xi >, 
Xi G [0, 1], i = 0, 1, 2. Geometrically, samples are points in a 3-dimension unit cube. 
The class boundary is a sphere defined as: H(x) = ~ = 0, where 

c is the center of the sphere, r the radius, x is labelled class 1 if B{x) < 0, class 0 
otherwise. This learning task is not easy, because the feature space is continuous and the 
class boundary is non-linear. 

We evaluate our boosting scheme extended with change detection, named as, Adaptive 
Boosting, and compare it with Weighted Bagging and Bagging. 

In the following experiments, we use decision trees as our base model, but the 
boosting technique can, in principle, be used with any other traditional learning model. 
The standard C4.5 algorithm is modified to generate small decision trees as base models, 
with the number of terminal nodes ranging from 2 to 32. Full-grown decision trees 
generated by C4.5 are also used for comparison, marked as fullsize in Figure 2-4 and 
Table 1-2. 




(2) 



^ T. 



( 3 ) 
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Bagging 



2 4 8 16 32 fullsiza 

# Decision Tree Terminal Nodes 

Fig. 2. Performance comparison of the adaptive boosting vs the bagging on stationary data. The 
weighted bagging is omitted as it performs almost the same as the bagging. 

4.1 Evaluation of Boosting Scheme 

The boosting scheme is first compared against two bagging ensembles on stationary data. 
Samples are randomly generated in the unit cube. Noise is introduced in the training data 
by randomly flipping the class labels with a probability of p. Each data block has n sam- 
ples and there are 100 blocks in total. The testing data set contains 50k noiseless samples 
uniformly distributed in the unit cube. An ensemble of M classifiers is maintained. It is 
updated after each block and evaluated on the test data set. Performance is measured 
using the generalization accuracy averaged over 100 ensembles. 

Figure 2 shows the generalization performance when p=5%, n=2k and M=30. 
Weighted bagging is omitted from the figure because it makes almost the same pre- 
dictions as bagging, a not surprising result for stationary data. Figure 2 shows that the 
boosting scheme clearly outperforms bagging. Most importantly, boosting ensembles 
with very simple trees performs well. In fact, the boosted two-level trees(2 terminal 
nodes) have a performance comparable to bagging using the full size trees. This sup- 
ports the theoretical study that boosting improves weak learners. 

Higher accuracy of boosted weak learners is also observed for (1) block size n of 
500, Ik, 2k and 4k, (2) ensemble size M of 10, 20, 30, 40, 50, and (3) noise level of 5%, 
10% and 20%. 



4.2 Learning with Gradual Shifts 

Gradual concept shifts are introduced by moving the center of the class boundary between 
adjacent blocks. The movement is along each dimension with a step of ±i5. The value 
of 6 controls the level of shifts from small to moderate, and the sign of 5 is randomly 
assigned. The percentage of positive samples in these blocks ranges from 16% to 25%. 
Noise level p is set to be 5%, 10% and 20% across multiple runs. 

The average accuracies are shown in Figure 3 for small shifts (5 = 0.01), and in 
Figure 4 for moderate shifts {5 = 0.03). Results of other settings are shown in Table 1. 
These experiments are conducted where the block size is 2k. Similar results are obtained 
for other block sizes. The results are summarized below: 
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Fig. 3. Performance comparison of the three ensembles on data with small gradual concept shifts. 



0.95 
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Bagging 
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Fig. 4. Performance comparison of the ensembles on data with moderate gradual concept shifts. 



Table 1. Performance comparison of the ensembles on data with varying levels of concept shifts. 
Top accuracies shown in bold fonts. 





3 = .005 

2 4 8 fullsize 


3 = .02 

2 4 8 fullsize 


Adaptive Boosting 


89.2% 


93.2% 


93.9% 


94.9% 


92.2% 


94.5% 


95.7% 


95.8% 


Weighted Bagging 


71.8% 


84.2% 


89.6% 


91.8% 


83.7% 


92.0% 


93.2% 


94.2% 


Bagging 


71.8% 


84.4% 


90.0% 


92.5% 


S3J% 


91.4% 


92.4% 


90.7% 



- Adaptive boosting outperforms two bagging methods at all time, demonstrating the 
benefits of the change detection technique; and 

- Boosting is especially effective with simple trees (terminal nodes < 8), achieving a 
performance compatible with, or even better than, the bagging ensembles with large 
trees. 



4.3 Learning with Abrupt Shifts 

We study learning with abrupt shifts with two sets of experiments. Abrupt concept shifts 
are introduced every 40 blocks; three abrupt shifts occur at block 40, 80 and 120. In one 
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Table 2. Performance comparison of three ensembles on data with abrupt shifts or mixed shifts. 
Top accuracies are shown in bold fonts. 



= ±0.1 


52 = 0.00 
4 fullsize 


82 = ±0.01 
4 fullsize 


Adaptive Boosting 


93.2% 


95.1% 


93.1% 


94.1% 


Weighted Bagging 


86.3% 


92.5% 


86.6% 


91.3% 


Bagging 


86.3% 


92.7% 


85.0% 


88.1% 




Data Blocks 



Fig. 5. Performance comparison of the three ensembles on data with abrupt shifts. Base decision 
trees have no more than 8 terminal nodes. 



set of experiments, data stays stationary between these blocks. In the other set, small 
shifts are mixed between adjacent blocks. The concept drift parameters are set to be 

= ±0.1 for abrupt shifts , and 82 = ±0.01 for small shifts. 

Figure 5 and Figure 6 show the experiments when base decision trees have no more 
than 8 terminal nodes. Clearly the bagging ensembles, even with an empirical weight- 
ing scheme, are seriously impaired at changing points. Our hypothesis, that obsolete 
classifiers are detrimental to overall performance even if they are weighed down, are 
proved experimentally. Adaptive boosting ensemble, on the other hand, is able to re- 
spond promptly to abrupt changes by explicit change detection efforts. For base models 
of different sizes, we show some of the results in Table 2. The accuracy is averaged over 
160 blocks for each run. 



4.4 Experiments on Real Life Data 

In this subsection we further verify our algorithm on a real life data containing 100k 
credit card transactions. The data has 20 features including the transaction amount, the 
time of the transaction, etc. The task is to predict fraudulent transactions. Detailed data 
description is given in [14]. The part of the data we use contains 100k transaction each 
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Data Blocks 

Fig. 6. Performance comparison of the three ensembles on data with both abrupt and small shifts. 
Base decision trees have no more than 8 terminal nodes. 

with a transaction amount between $0 and $21. Concept drift is simulated by sorting 
transactions by changes by the transaction amount. 




Data Blocks 

Fig. 7. Performance comparison of the three ensembles on credit card data. Concept shifts are 
simulated by sorting the transactions by the transaction amount. 



We study the ensemble performance using varying block sizes (Ik, 2k, 3k and 4k), 
and different base models (decision trees with terminal nodes no more than 2, 4, 8 and 
full-size trees). We show one experiment in Figure 7, where the block size is Ik, and 
the base models have at most 8 terminal nodes. The curve shows three dramatic drops 
in accuracy for bagging, two for weighted bagging, but only a small one for adaptive 
boosting. These drops occur when the transaction amount jumps. Overall, the boosting 
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ensemble is much better than the two baggings. This is also true for the other experiments, 
whose details are omitted here due to space limit. 

The boosting scheme is also the fastest. Moreover, the training time is almost not 
affected by the size of base models. This is due to the fact that the later base models tend 
to have very simple structures; many of them are just decision stumps (one level decision 
trees). On the other hand, training time of the bagging methods increases dramatically 
as the base decision trees grow larger. For example, when the base decision tree is full- 
grown, the weighted bagging takes 5 times longer to do the training and produces a 
tree 7 times larger on average. The comparison is conducted on a 2.26MHz Pentium 4 
Processor. Details are shown in Figure 8. 

To summarize, the real application experiment confirms the advantages of our boost- 
ing ensemble methods: it is fast and light, with good adaptability. 




Fig. 8. Comparison of the adaptive boosting and the weighted bagging, in terms of (a) building 
time, and (b) average decision tree size. In (a), the total amount of data is fixed for different block 
sizes. 



5 Summary and Future Work 

In this paper, we propose an adaptive boosting ensemble method that is different from 
previous work in two aspects: (1) We boost very simple base models to build effective 
ensembles with competitive accuracy; and (2) We propose a change detection technique 
to actively adapt to changes in the underlying concept. We compare adaptive boosting 
ensemble methods with two bagging ensemble-based methods through extensive exper- 
iments. Results on both synthetic and real-life data set show that our method is much 
faster, demands less memory, more adaptive and accurate. 

The current method can be improved in several aspects. For example, our study of 
the trend of the underlying concept is limited to the detection of significant changes. If 
changes can be detected on a finer scale, new classifiers need not be built when changes 
are trivial, thus training time can be further saved without loss on accuracy. We also plan 
to study a classifier weighting scheme to improve ensemble accuracy. 
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Abstract. Generic ensemble methods can achieve excellent learning 
performance, but are not good candidates for active learning because 
of their different design purposes. We investigate how to use diversity 
of the member classifiers of an ensemble for efficient active learning. We 
empirically show, using benchmark data sets, that (1) to achieve a good 
(stable) ensemble, the number of classifiers needed in the ensemble varies 
for different data sets; (2) feature selection can be applied for classifier 
selection from ensembles to construct compact ensembles with high per- 
formance. Benchmark data sets and a real-world application are used to 
demonstrate the effectiveness of the proposed approach. 



1 Introduction 

Active learning is a framework in which the learner has the freedom to select 
which data points are added to its training set [11]. An active learner may be- 
gin with a small number of labeled instances, carefully select a few additional 
instances for which it requests labels, learn from the result of those requests, 
and then using its newly-gained knowledge, carefully choose which instances to 
request next. More often than not, data in forms of text (including emails), 
image, multi-media are unlabeled, yet many supervised learning tasks need to 
be performed [2,10] in real-world applications. Active learning can significantly 
decrease the number of required labeled instances, thus greatly reduce expert 
involvement. Ensemble methods are learning algorithms that construct a set of 
classifiers and then classify new instances by taking a weighted or unweighted 
vote of their predictions. An ensemble often has smaller expected loss or error 
rate than any of the n individual (member) classifiers. A good ensemble is one 
whose members are both accurate and diverse [4]. This work explores the rela- 
tionship between the two learning frameworks, attempts to take advantage of 
the learning performance of ensemble methods for active learning in a real-world 
application, and studies how to construct ensembles for effective active learning. 

2 Our Approach 

2.1 Ensembles and Active Learning 

Active learning can be very useful where there are limited resources for label- 
ing data, and obtaining these labels is time-consuming or difficult [11]. There 
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exist widely used active learning methods. Some examples are: Uncertainty sam- 
pling [7] selects the instance on which the current learner has lowest certainty; 
Pool-based sampling [9] selects the best instances from the entire pool of unla- 
beled instances; and Query-by-Committee [6,12] selects instances that have high 
classification variance themselves. 

Constructing good ensembles of classifiers has been one of the most active 
areas of research in supervised learning [4] . The main discovery is that ensembles 
are often much more accurate than the member classifiers that make them up. 
A necessary and sufficient condition for an ensemble to be more accurate than 
any of its members is that the member classifiers are accurate and diverse. Two 
classifiers are diverse if they make different (or uncorrelated) errors on new data 
points. Many methods for constructing ensembles have been developed such as 
Bagging [3] and Boosting [5]. We consider Bagging in this work as it is the most 
straightforward way of manipulating the training data to form ensembles [4]. 

Disagreement or diversity of classifiers are used for different purposes for the 
two learning frameworks: in generic ensemble learning, diversity of classifiers is 
used to ensure high accuracy by voting; in active learning, disagreement of classi- 
fiers is used to identify critical instances for labeling. In order for active learning 
to work effectively, we need a small number of highly accurate classifiers so that 
they seldom disagree with each other. Since ensemble methods have shown their 
robustness in producing highly accurate classifiers, we have investigated the use 
of class-specific ensembles (dual ensembles) , and shown their effectiveness in our 
previous work [8]. Next, we empirically investigate whether it is necessary to 
find compact dual ensembles and then we present a method to find them while 
maintaining good performance. 

2.2 Observations from Experiments on Benchmark Data Sets 

Ensemble’s goodness can be measured by accuracy and diversity. Let Y(x) = 
yi{x), ...yn{x) be the set of the predictions made by member classifiers Ci, ..., C„ 
of ensemble E on instance (x,y) where x is input, and y is the true class. The 
ensemble prediction of a uniform voting ensemble for input x under loss func- 
tion I is, y{x) = argmiUy^Y Ec^ciKiicix) , y]. The loss of an ensemble on instance 
(x, y) under loss function I is given by L{{x, y)) = l{y{x),y). The diversity of an 
ensemble on input x under loss function I is given hy D = Ec^c[Kyc{x),y{x))]. 
The error rate for a data set with N instances can be calculated as e = 
where Li is the loss for instance Xi. Accuracy of ensemble E is 1 — e. Diversity 
is the expected loss incurred by the predictions of the member classifiers relative 
to the ensemble prediction. Commonly used loss functions include square loss, 
absolute loss, and zero-one loss. We use zero-one loss in this work. 

The purpose of these experiments is to observe how diversity and error rate 
change as ensemble size increases. We use benchmark data sets from the UCI 
repository [1] in these experiments. We use Weka [13] implementation of Bag- 
ging [3] as the ensemble generation method and J4.8 (without pruning) as the 
base learning algorithm. For each data set, we run Bagging with increasing en- 
semble sizes from 5 to 151 and record each ensemble’s error rate e and diversity 
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D. We run 10-fold cross validation and calculate the average values, e and D. 
We observed that as the ensemble sizes increase, diversity values increase and 
approach to the maximum, and error rates decrease and become stable. The 
results show that smaller ensembles (with 30-70 classifiers) can achieve accuracy 
and diversity values similar to those of larger ensembles. We will now show a 
procedure for selecting compact dual ensembles from larger ensembles. 

2.3 Selecting Compact Dual Ensembles via Feature Selection 

The experiments with the benchmark data sets show that there exist smaller 
ensembles that can have similar accuracy and diversity as that of large ensembles. 
We need to select classifiers with these two criteria. We build our initial ensemble, 
Emax by setting max = 100 member classifiers. We now need to ejficiently find a 
compact ensemble Em (with M classifiers) that can have similar error rate and 
diversity of Emax- We use all the learned classifiers {Ck) to generate predictions 
for instances (xi,yi) : yf = Ck{xi). The resulting dataset consists of instances 
of the form ((y^^, ..., yf-), y^). After this data set is constructed, the problem 
of selecting member classifiers becomes one of feature selection. Here features 
actually represent member classifiers, therefore we also need to consider this 
special nature for the feature selection algorithm. 



DualE: selecting compact dual ensembles 

input: Tr: Training data, FSet: All classifiers in Emax, N: max-, 

output: Ei: Optimal ensemble for class=l, Eq: Optimal ensemble for class=0; 

01 Generate N classifiers from with Bagging; 

02 Tri •<— Instances(Tr) with class labels 1; 

03 Tro ■<— Instances(Tr) with class labels 0; 

04 Calculate diversity, Dq and error rate, eo for Emax on Tri; 

05 U ^ N-, L M ^ 

06 while \U — M\ > 1 

07 Pick M classifiers from FSet to form E'-, 

08 Calculate diversity, D' and error rate, e' for E' on Tri; 

09 if (^a=^ < 1%) and (sl^ < 1%) 

10 u'L M-, M M - 

11 else 

12 M-, M ^ M + 

13 El ^ E'; 

14 Repeat steps 5 to 12 for Tro; Eo -4— E'; 



Fig. 1. Algorithm for Selecting Dual Ensembles 

We design an algorithm DualE that takes 0(log max) to determine M where 
max is the size of the starting ensemble (e.g., 100). In words, we test an ensemble 
Em with size M which is between upper and lower bounds U and L (initialized 
as max and 0 respectively). If Em’s performance is similar to that of Emax, we 
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set U = M and M = {L+M)/2 ; otherwise, set L = M and M = (M+t/)/2. The 
details are given in Fig. 1. Ensemble performance is defined by error rate e and 
diversity D. The diversity values of the two ensembles are similar if < P 

where p is user defined (0 < p < 1) and Dq is the reference ensemble’s {Emax's) 
diversity. Similarly, error for the two ensembles is similar if ^ < p where cq 

is the reference ensemble’s error rate. 



3 Experiments 



Table 1. E. 



for Image data 



Two sets of experiments are conducted with DualE: one is on a image data 
set and other on a benchmark data set (breast). The purpose is to examine if 
the compact dual ensembles selected by DualE can work as well as the entire 
ensemble, E^ax- When dual ensembles are used, it is possible that they disagree. 
These instances are called uncertain instances (UC). In context of active learning, 
the uncertain instances will be labeled by an expert. Prediction of Emax is by 
majority and there is no disagreement. So for Emax only the accuracy is reported. 

The image mining problem, that we 
study here, is to classify Egeria Densa in 
aerial images. To automate Egeria clas- 
sification, we ask experts to label im- 
ages, but want to minimize the task. 

Active learning is employed to reduce 
this expert involvement. The idea is to 
let experts label some instances, learn 
from these labeled instances, and then 
apply the active learner to unseen im- 
ages. We have 17 images with 5329 in- 
stances, represented by 13 attributes of 
color, texture and edge. One image is 
used for training and the rest for testing. 

We first train an initial ensemble Emax 
with max = 100 on the training image, 
then obtain accuracy of Emax for the 17 

testing images. Dual Eg are the ensembles selected by using DualE. Table 3 
clearly shows that accuracy of Eg is similar to that of Emax- The number of 
uncertain regions is also relatively small. This demonstrates the effectiveness of 
using the dual ensembles, Eg to reduce expert involvement for labeling. 

For the breast dataset, we design a new 3-fold cross validation scheme, which 
uses 1-fold for training, the remaining 2 folds for testing. This is repeated for all 
the 3 folds of the training data. The results for the breast data set are shown 
in Table 2. We also randomly select member classifiers to form random dual 
ensembles. Dual Er- We do so 10 times and report the average accuracy and 
number of uncertain instances. In Table 2, Dual Eg are the selected ensembles 
(using DualE), and Emax is the initial ensemble. Accuracy gains for Dual Er 
and Emax (and UC Incr for Dual Er) against Eg are reported. Comparing dual 



Image 


Dual Es 


Emax 


Acc% 




Acc% 


Acc Gain% 


1 


81.91 


1 


81.90 


-0.0122 


2 


90.00 


0 


90.00 


0.0000 


3 


78.28 


38 


79.28 


1.2775 


4 


87.09 


34 


86.47 


-0.7119 


5 


79.41 


0 


79.73 


0.4029 


6 


84.51 


88 


84.77 


0.3076 


7 


85.00 


3 


85.41 


0.4823 


8 


85.95 


18 


86.6 


0.7562 


9 


71.46 


0 


72.32 


1.2035 


10 


91.08 


2 


90.8 


-0.3074 


11 


89.15 


31 


88.82 


-0.3702 


12 


75.91 


0 


76.02 


0.1449 


13 


66.84 


0 


67.38 


0.8079 


14 


73.06 


49 


73.73 


0.9170 


15 


83.1 


1 


83.24 


0.1684 


16 


76.57 


14 


76.82 


0.3265 


17 


87.67 


31 


88.42 


0.8555 


Average 


81.58 


18.24 


81.86 


0.3676 
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Table 2. Comparing Ea with Er and Emax for Breast data 





Dual Eg 


Dual Er 


E 

■‘-'max 




Acc% 


#U<T 


Acc% 


#uc 


Acc Cjain% 


UCJ lncr% 


Acc% 


Acc Cjain.% 


Fold 1 


95.9227 


3 


94.0773 


13.6 


-1.9238 


353.33 


96.1373 


0.2237 


Fold 2 


97.2103 


5 


94.4206 


15.2 


-2.8698 


204.00 


96.9957 


-0.2208 


Fold 3 


94.8498 


12 


93.5193 


8.7 


-1.4027 


-27.50 


94.4206 


-0.4525 


Average 


95.9943 


6.67 


94.0057 


12.5 


-2.0655 


176.61 


95.8512 


-0.1498 



Eg and dual Er, dual Er exhibit lower accuracy and more uncertain instances. 
Comparing dual Eg and Emax, we observe no significant change in accuracy. 
This is consistent with what we want (maintain both accuracy and diversity). 

4 Conclusions 

In this work, we point out that (1) generic ensemble methods are not suitable 
for active learning (2) dual ensembles are very good for active learning if we can 
build compact dual ensembles. Our empirical study suggests that there exist 
such compact ensembles. We propose DualE that can find compact ensembles 
with good performance via feature selection. Experiments on a benchmark and 
an image data set exhibit the effectiveness of dual ensembles for active learning. 
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Abstract. In this paper, the impact of the size of the training set on 
the beneht from ensemble, i.e. the gains obtained by employing ensem- 
ble learning paradigms, is empirically studied. Experiments on Bagged/ 
Boosted J4.8 decision trees with/without pruning show that enlarging 
the training set tends to improve the benefit from Boosting but does not 
signihcantly impact the benefit from Bagging. This phenomenon is then 
explained from the view of bias- variance reduction. Moreover, it is shown 
that even for Boosting, the benefit does not always increase consistently 
along with the increase of the training set size since single learners some- 
times may learn relatively more from additional training data that are 
randomly provided than ensembles do. Furthermore, it is observed that 
the benefit from ensemble of unpruned decision trees is usually bigger 
than that from ensemble of pruned decision trees. This phenomenon is 
then explained from the view of error-ambiguity balance. 



1 Introduction 

Ensemble learning paradigms train a collection of learners to solve a problem. 
Since the generalization ability of an ensemble is usually better than that of a 
single learner, one of the most active areas of research in supervised learning has 
been to study paradigms for constructing good ensembles [5]. 

This paper does not attempt to propose any new ensemble algorithm. Instead, 
it tries to explore how the change of the training set size impacts the benefit from 
ensemble, i.e. the gains obtained by employing ensemble learning paradigms. 
Having an insight into this may be helpful to better exerting the potential of 
ensemble learning paradigms. This goal is pursued in this paper with an empirical 
study on ensembles of pruned or unpruned J4.8 decision trees [9] generated by 
two popular ensemble algorithms, i.e. Bagging [3] and Boosting (In fact. Boosting 
is a family of ensemble algorithms, but here the term is used to refer the most 
famous member of this family, i.e. AdaBoost [6]). Experimental results show that 
enlarging training set does not necessarily enlarges the benefit from ensemble. 
Moreover, interesting issues on the benefit from ensemble, which is related to the 
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characteristics of Bagging and Boosting and the effect of decision tree pruning, 
have been disclosed and discussed. 

The rest of this paper is organized as follows. Section 2 describes the empirical 
study. Section 3 analyzes the experimental results. Section 4 summarizes the 
observations and derivations. 



2 The Empirical Study 

Twelve data sets with 2,000 to 7,200 examples, 10 to 36 attributes, and 2 to 10 
classes from the UCI Machine Learning Repository [2] are used in the empirical 
study. Information on the experimental data sets are tabulated in Table 1. 



Table 1. Experimental data sets 



Data set 


Size ■ 


Attribute 

Categorical Continuous 


' Class 


allbp 


2,800 


22 


7 


3 


ann 


7,200 


15 


6 


3 


block 


5,473 


0 


10 


5 


hypothyroid 


3,772 


22 


7 


2 


kr-vs-kp 


3,196 


36 


0 


2 


led! 


2,000 


7 


0 


10 


Ied24 


2,000 


24 


0 


10 


sat 


6,435 


0 


36 


6 


segment 


2,310 


0 


19 


7 


sick 


3,772 


22 


7 


2 


sick- euthyroid 


3,156 


22 


7 


2 


waveform 


5,000 


0 


21 


3 



Each original data set is partitioned into ten subsets with similar distribu- 
tions. At the first time, only one subset is used; at the second time, two subsets 
are used; and so on. The earlier generated data sets are proper subsets of the 
later ones. In this way, the increase of the size of the data set is simulated. 

On each generated data set, 10-fold cross validation is performed. In each fold. 
Bagging and Boosting are respectively employed to train an ensemble comprising 
20 pruned or unpruned J4.8 decision trees. For comparison, a single J4.8 decision 
tree is also trained from the training set of the ensembles. The whole process is 
repeated for ten times, and the average error ratios of the ensembles generated 
by Bagging and Boosting against the single decision trees are recorded, as shown 
in Tables 2 to 5, respectively. The predictive error rates of the single decision 
trees are shown in Tables 6 and 7. In these tables the first row indicates the 
percentage of data in the original data sets that are used, and the numbers 
following ‘±’ are the standard deviations. 
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Bag-prun — A — Boost-prun 

Bag-unprun — ■ — Boost*unprun 




relative training set size 



Fig. 1. The geometrical mean error ratio of Bagged/Boosted J4.8 decision trees 
with/without pruning against single J4.8 decision trees with/without pruning 

Here the error ratio is defined as the result of dividing the predictive error 
rate of an ensemble by that of a single decision tree. A smaller error ratio means 
relatively bigger benefit from ensemble, while a bigger error ratio means relative 
smaller benefit from ensemble. If an ensemble is worse than a single decision 
tree, then its error ratio is bigger than 1.0. 

In order to exhibit the overall tendency, the geometric mean error ratio, i.e. 
average ratio across all data sets, are also provided in Tables 2 to 5, which is 
then explicitly depicted in Fig. 1. 

3 Discussion 

3.1 Bagging and Boosting 

An interesting phenomenon exposed by Fig. 1 and Tables 2 to 5 is that the ben- 
efit from Bagging and Boosting exhibit quite different behaviors on the change 
of the training set size. In detail, although there are some fluctuations, the ben- 
efit from Bagging remains relatively unvaried while that from Boosting tends to 
be enlarged when the training set size increases. In order to explain this phe- 
nomenon, it may be helpful to consider the different characteristics of Bagging 
and Boosting from the view of bias-variance reduction. 

Given a learning target and the size of training set, the expected error of a 
learning algorithm can be broken into the sum of three non-negative quantities, 
i.e. the intrinsic noise, the bias, and the variance [7]. The intrinsic noise is a 
lower bound on the expected error of any learning algorithm on the target. The 
bias measures how closely the average estimate of the learning algorithm is able 
to approximate the target. The variance measures how much the estimate of 
the learning algorithm fluctuates for the different training sets of the same size. 
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Table 2. Error ratios of Bagged J4.8 decision trees against single J4.8 decision trees. 
All the trees are pruned. 



Data set 


10% 


20% 


30% 


40% 


50% 


allbp 


.909T.193 


.914T.111 


.952T.070 


1.01T.125 


.961T.149 


ann 


1.15T.203 


.899T.135 1.03T.180 


.952T.209 


.931T.086 


block 


.911T.186 


.908T.061 .906T.119 


.886T.067 


.885T.046 


hypothyroid 


1.10T.236 


1.01T.223 1.14T.243 


1.04T.135 1.19T.247 


kr-vs-kp 


.938T.104 


.897T.180 1.06T.376 


.991T.138 


.892T.119 


led! 


.976T.054 


.976T.049 


.978T.017 


.986T.025 


.988T.019 


Ied24 


.902T.063 


.920T.042 


.941T.048 


.961T.027 


.958T.035 


sat 


.786T.052 


.751T.031 


.736T.043 


.722T.040 


.729T.021 


segment 


.800T.142 


.898T.128 


.913T.106 


.851T.105 


.816T.102 


siek 


1.22T.352 


.938T.201 


.999T.121 


.975T.150 1.08T.143 


siek- euthyroid 


.826T.257 


.952T.212 


.955T.159 


.945T.139 


.911T.080 


waveform 


.672T.138 


.663T.068 


.664T.084 


.671T.073 


.655T.056 


geometric- mean 


.933 


.894 


.940 


.916 


.916 


Data set 


60% 


70% 


80% 


90% 


100% 


allbp 


.958T.078 


1.02T.085 


.940T.080 


1.01T.116 1.01T.038 


ann 


1.08T.170 


.985T.152 1.04T.140 


1.09T.161 1.15T.162 


block 


.876T.043 


.907T.057 .873T.055 


.864T.044 


.862T.048 


hypothyroid 


1.14T.255 


.908T.132 1.04T.194 


.955T.076 


.973T.089 


kr-vs-kp 


.810T.169 


.957T.136 1.02T.142 


1.02T.130 1.07±.082 


led! 


.996T.014 


1.01T.014 1.00T.012 


l.Oli.Oll 1.01T.013 


Ied24 


.960T.036 


.967T.023 


.969T.028 


.972T.028 


.979T.021 


sat 


.703T.029 


.719T.027 


.724T.014 


.715T.019 


.704T.026 


segment 


.849T.121 


.859T.085 


.845T.086 


.826T.095 


.857T.092 


siek 


1.07T.096 


1.10T.250 1.14T.365 


.978T.104 


.954T.081 


siek- euthyroid 


.950T.129 


.956T.119 


.941T.085 


.992T.068 1.00T.108 


waveform 


.670T.057 


.650T.037 


.690T.051 


.699T.028 


.679T.030 


geometric- mean 


.922 


.920 


.935 


.928 


.937 



Since the intrinsic noise is an inherent property of the given target, usually only 
the bias and variance are concerned. 

Previous research shows that Bagging works mainly through reducing the 
variance [1] [4] . It is evident that such a reduction is realized by utilizing bootstrap 
sampling to capture the variance among the possible training sets under the given 
size and then smoothing the variance through combining the trained component 
learners. Suppose the original data set is D, a new data set D' is bootstrap 
sampled from D, and the size of is the same as that of D, i.e. \D\. Then, the 
size of the shared part between D and can be estimated according to Eq. 1, 
which shows that the average overlap ratio is a constant, roughly 63.2%. 

(l - (1 - \D\ w (1 - 0.368) \D\ = 0.632 \D\ 



( 1 ) 
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Table 3. Error ratios of Boosted J4.8 decision trees against single J4.8 decision trees. 
All the trees are pruned. 



Data set 


10% 


20% 


30% 


40% 


50% 


allbp 


1.00T.190 


.930T.202 


.895T.096 


1.02T.163 


.883T.157 


ann 


1.26T.349 


1.01T.238 


1.07T.138 


.926T.116 


.924T.145 


block 


.858T.101 


.934T.110 


.963T.149 


.951T.073 


1.01i.044 


hypothyroid 


1.63T.506 


.983T.197 


1.07T.236 


.924T.394 


1.12T.394 


kr-vs-kp 


.707T.244 


.669T.151 


.823T.261 


.827T.179 


.751T.113 


led! 


.998T.008 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


Ied24 


1.06T.104 


1.09T.079 


1.16T.069 


1.15T.054 


1.17i.059 


sat 


.717T.041 


.679T.047 


.653T.040 


.658T.038 


.657T.025 


segment 


.783T.138 


.702T.226 


.654T.084 


.605T.139 


.559T.141 


siek 


1.47T.749 


1.05±.251 


1.02T.130 


.898T.113 


.956i.071 


siek- euthyroid 


1.09T.329 


1.02T.272 


.971T.221 


.980T.209 


.969i.l62 


waveform 


.675T.165 


.644T.075 


.604T.057 


.628T.061 


.626T.056 


geometric- mean 1.02 


.893 


.907 


.881 


.885 


Data set 


60% 


70% 


80% 


90% 


100% 


allbp 


.896T.077 


.931T.092 


.817T.059 


.844T.058 


.854i.041 


ann 


1.04T.135 


.981T.089 


.894T.073 


.977T.118 


1.09T.146 


block 


.986T.050 


.979T.061 


.986T.074 


.997T.039 


.968i.033 


hypothyroid 


.984T.314 


.797T.214 


.886T.255 


.720T.184 


.771T.130 


kr-vs-kp 


.549T.220 


.522T.160 


.559T.153 


.623T.243 


.652T.177 


led! 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


Ied24 


1.16T.033 


1.19T.042 


1.20T.030 


1.18T.028 


1.18i.032 


sat 


.621T.031 


.656T.037 


.645T.028 


.641T.015 


.636i.018 


segment 


.589T.149 


.519T.086 


.497T.079 


.491i.no 


.523i.092 


siek 


.967T.149 


.934T.179 


.896T.167 


.838T.165 


.820i.078 


siek- euthyroid 


.952T.224 


.990T.193 


.975T.170 


1.02T.217 


1.02i.229 


waveform 


.624T.047 


.629T.043 


.655T.042 


.650T.032 


.669i.029 


geometric-mean .864 


.844 


.834 


.832 


.849 



This means that the variance among the possible samples with the same size 
that could be captured by a given number of trials of bootstrap sampling might 
not significantly change when the training set size changes. Therefore, when the 
training set size increases, the improvement of the ensemble owes much to the 
improvement of the component learners caused by the additional training data 
instead of the capturing of more variance through bootstrap sampling. Since the 
single learner also improves on the additional training data in the same way as 
the component learners in the ensemble do, the benefit from Bagging might not 
be significantly changed when the training set is enlarged. 

As for Boosting, previous research shows that it works through reducing both 
the bias and variance but primarily through reducing the bias [1][4]. It is evident 
that such a reduction on bias is realized mainly by utilizing adaptive sampling. 
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Table 4. Error ratios of Bagged J4.8 decision trees against single J4.8 decision trees. 
All the trees are unpruned. 



Data set 


10% 


20% 


30% 


40% 


50% 


allbp 


1.03T.047 


.733T.029 


.965T.005 


.890T.129 


.898T.107 


ann 


.995T.111 


.980T.109 


.985T.146 


1.00T.178 


.879T.153 


block 


.856T.205 


.902T.058 


.862T.120 


.847T.059 


.830T.049 


hypothyroid 


1.08T.149 


.867T.283 1.02T.236 


.864T.078 


1.07T.292 


kr-vs-kp 


.865T.151 


.998T.325 


.891T.200 


.899T.141 


.887T.252 


led! 


.979T.041 


.968T.045 


.978T.021 


.999T.026 


.988T.019 


Ied24 


.823T.082 


.803T.046 .803T.043 


.827T.030 


.806T.043 


sat 


.750T.061 


.735T.036 


.716T.051 


.693T.039 


.702T.017 


segment 


.795T.149 


.871T.127 


.898T.121 


.812T.085 


.810±.110 


siek 


1.08T.337 


.927T.188 


.866T.091 


.883T.148 


1.00T.112 


siek- euthyroid 


.818T.203 


.848T.100 


.827T.079 


.858±.113 


.836T.074 


waveform 


.669T.134 


.664T.067 


.668T.090 


.665T.070 


.656T.062 


geometric- mean 


.895 


.858 


.873 


.853 


.864 


Data set 


60% 


70% 


80% 


90% 


100% 


allbp 


.906T.089 


.885T.035 .876T.077 


.858T.030 


.901T.025 


ann 


.991T.149 


1.01T.095 


.953T.111 


1.11T.245 


1.23T.273 


block 


.836T.092 


.841T.028 .872T.065 


.844T.030 


.804T.044 


hypothyroid 


1.11T.453 


.788T.108 .993T.355 


.870T.160 


.858T.066 


kr-vs-kp 


.864T.158 


.869T.144 1.02T.181 


.893T.133 


1.01T.135 


led! 


.997T.015 


1.01T.012 .996T.016 


1.01T.009 


1.01T.013 


Ied24 


.796T.029 


.785T.022 


.800T.018 


.799T.017 


.807T.013 


sat 


.680T.026 


.687T.027 


.699T.014 


.694T.023 


.685T.023 


segment 


.792T.138 


.822T.105 


.808T.071 


.806T.078 


.806T.096 


siek 


.976T.145 


1.07T.170 1.03T.242 


.940T.121 


.909T.064 


siek- euthyroid 


.795T.097 


.824T.091 


.832T.077 


.843T.062 


.924T.062 


waveform 


.663T.051 


.647T.028 


.682T.053 


.693T.025 


.669T.028 


geometric- mean 


.867 


.853 


.880 


.863 


.884 



i.e. adaptively changing the data distribution to enable a component learner 
focus on hard examples for its predecessor. When the training set is enlarged, 
the adaptive sampling process becomes more effective since more hard examples 
for a component learner could be effectively identified and then passed on to the 
successive learner, some of which might not be identified when the training set 
is a relatively smaller one. Therefore, the reduction on bias may be enhanced 
along with the increase of the size of training set, which leads to that the benefit 
from Boosting tends to be enlarged. 

It is worth noting that Fig. 1 and Tables 2 to 5 also show that the benefit 
from ensemble, even for Boosting, does not always increase consistently when 
the training set size increases. This is not difficult to understand because the 
chances for an ensemble to get improved from the additional training data that 
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Table 5. Error ratios of Boosted J4.8 decision trees against single J4.8 decision trees. 
All the trees are nnpruned. 



Data set 


10% 


20% 


30% 


40% 


50% 


allbp 


.953T.187 


.746T.090 


.984T.128 


.883T.175 


.851T.131 


ann 


1.47T.300 


1.07T.123 


.939T.146 


1.06T.366 


.832T.168 


block 


.860T.210 


.898T.122 


.911T.075 


.890T.032 


.943T.048 


hypothyroid 


1.44T.453 


.949T.345 


.951T.350 


.753T.316 


l.OOi.262 


kr-vs-kp 


.646T.243 


.757T.236 


.762T.281 


.865T.315 


.656T.197 


led! 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


Ied24 


.968T.107 


.974T.061 


.956T.052 


.980T.022 


.965T.036 


sat 


.691T.071 


.672T.070 


.640T.038 


.631T.029 


.624T.015 


segment 


.723T.087 


.646T.103 


.549T.064 


.594T.097 


.539T.065 


siek 


1.23T.886 


1.11T.259 


.917T.130 


.876T.123 


.960T.152 


siek- euthyroid 


.985T.242 


.948T.171 


.860T.162 


.906T.097 


.877T.095 


waveform 


.665±.114 


.643T.101 


.621T.066 


.631T.063 


.600T.066 


geometric- mean 


.969 


.868 


.841 


.839 


.821 


Data set 


60% 


70% 


80% 


90% 


100% 


allbp 


.875±.138 


.875T.054 


.751T.105 


.755T.069 


.727T.050 


ann 


.989T.221 


1.00T.108 


.948T.159 


1.02T.150 


1.17T.103 


block 


.928T.048 


.938T.075 


.975T.048 


.930T.011 


.923T.051 


hypothyroid 


.869T.366 


.742T.291 


.811±.118 


.696T.121 


.726T.103 


kr-vs-kp 


.673T.282 


.585T.075 


.719T.175 


.619T.066 


.613T.143 


led! 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


l.OOi.OOO 


Ied24 


.953T.026 


.958T.020 


.976T.024 


.959T.020 


.972T.016 


sat 


.614T.038 


.615T.027 


.611±.012 


.620T.020 


.624T.017 


segment 


.541T.064 


.506T.051 


.501T.047 


.497T.081 


.511T.057 


siek 


.892T.103 


.936T.147 


.931T.209 


.860T.122 


.864T.074 


siek- euthyroid 


.823T.077 


.870T.079 


.870T.089 


.895T.064 


.936T.078 


waveform 


.601T.038 


.638T.028 


.637T.045 


.636T.050 


.659T.040 


geometric- mean 


.813 


.805 


.811 


.791 


.810 



are randomly provided might be less than that of a single learner since the 
ensemble is usually far stronger than the single learner. It is analogous to the 
fact that improving a poor learner is more easier than improving a strong learner. 
Therefore, the benefit of ensemble shrinks if the improvement of the ensemble is 
not so big as that of the single learner on additional training data. However, if 
the additional training data have been adequately selected so that most of them 
can benefit the ensemble, then both the ensemble and the single learner could 
be significantly improved while the benefit from ensemble won’t be decreased. 

3.2 Pruned and Unpruned Trees 

Another interesting phenomenon exposed by Fig. 1 and Tables 2 to 5 is that the 
benefit from ensemble comprising imprimed decision trees is always bigger than 
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Table 6. Predictive error rate (%) of pruned single C4.5 decision trees. 



Data set 


10% 


20% 


30% 


40% 


50% 


allbp 


4.40T1.36 


4.00i0.58 


3.57i0.52 


3.25i0.55 


3.35i0.38 


ann 


0.98T0.41 


0.73i0.25 


0.52i0.16 


0.49i0.14 


0.45i0.08 


block 


4.39T1.00 


3.97i0.27 


3.59i0.53 


3.60i0.32 


3.19i0.29 


hypothyroid 


1.45T0.57 


1.24i0.39 


0.97i0.34 


0.69i0.14 


0.57i0.16 


kr-vs-kp 


4.62T1.37 


2.75i0.58 


1.75i0.57 


1.36i0.40 


1.14i0.26 


led! 


35.21T2.92 


31.83i3.07 


30.08il.87 


28.81il.67 


28.01il.48 


Ied24 


36.19T6.13 


33.17i3.35 


30.67i2.83 


30.93il.49 


29.93il.41 


sat 


19.22T1.68 


17.02il.04 


15.87i0.59 


15.33i0.52 


14.63i0.54 


segment 


9.16T3.07 


6.94i2.00 


5.85il.41 


5.59il.37 


4.94il.21 


sick 


2.21T0.97 


2.27i0.97 


1.86i0.36 


1.79i0.27 


1.63i0.33 


sick- euthyroid 


4.07T1.97 


3.39il.68 


3.20il.56 


2.95il.00 


2.67i0.80 


waveform 


11.27±1.87 


10.43il.59 


10.44il.00 


9.70i0.93 


9.96i0.58 


Data set 


60% 


70% 


80% 


90% 


100% 


allbp 


3.10T0.32 


2.88i0.28 


2.94i0.30 


2.87i0.29 


2.76i0.17 


ann 


0.39T0.07 


0.41i0.06 


0.39i0.05 


0.33i0.03 


0.30i0.03 


block 


3.20T0.18 


3.11i0.21 


3.15i0.21 


3.08i0.15 


3.03i0.11 


hypothyroid 


0.53T0.16 


0.54i0.12 


0.53i0.11 


0.49i0.06 


0.45i0.04 


kr-vs-kp 


1.01T0.20 


0.93i0.16 


0.80i0.13 


0.72i0.13 


0.57i0.08 


led! 


27.82T1.41 


27.39i0.68 


27.02i0.56 


26.90i0.50 


26.73i0.27 


Ied24 


29.02il.28 


28.56il.03 


28.30i0.72 


28.20i0.34 


27.78i0.54 


sat 


14.83i0.59 


14.lli0.46 


14.lli0.53 


13.70i0.49 


13.54i0.30 


segment 


4.11il.06 


3.82i0.86 


3.46i0.82 


3.21i0.63 


2.93i0.59 


sick 


1.50i0.30 


1.42i0.28 


1.33i0.35 


1.39i0.25 


1.38i0.29 


sick- euthyroid 


2.69il.08 


2.51i0.62 


2.42i0.52 


2.23i0.49 


2.21i0.49 


waveform 


9.88i0.48 


9.75i0.56 


9.38i0.45 


9.13i0.46 


8.95i0.31 



that comprising pruned decision trees, despite whether Bagging or Boosting is 
employed. In order to explain this phenomenon, it may be helpful to consider 
the effect of decision tree pruning from the view of error-ambiguity balance. 

It has been shown that the generalization error of an ensemble can be decom- 
posed into two terms, i.e. E = E — A, where E is the average generalization error 
of the component learners while A is the average ambiguity [8] . The smaller the 
E and the bigger the A, the better the ensemble. 

In general, the purpose of decision tree pruning is to avoid overfitting. With 
the help of pruning, the generalization ability of a decision tree is usually im- 
proved. Thus, the E of an ensemble comprising pruned decision trees may be 
smaller than that of an ensemble comprising unpruned decision trees. But on 
the other hand, pruning usually causes the decrease of the ambiguity among 
the decision trees. This is because some trees may become more similar after 
pruning. Thus, the A of an ensemble comprising pruned decision trees may be 
smaller than that of an ensemble comprising unpruned decision trees. 
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Table 7 . Predictive error rate (%) of unpruned single C4.5 decision trees. 



Data set 


10% 


20% 


30% 


40% 


50% 


allbp 


4.44T1.27 


4.62i0.59 


3.91i0.66 


3.56i0.72 


3.65i0.44 


ann 


1.00T0.40 


0.73i0.26 


0.52i0.16 


0.51i0.14 


0.47i0.10 


block 


4.50T1.04 


4.00i0.26 


3.69i0.54 


3.77i0.27 


3.34i0.20 


hypothyroid 


1.74T0.88 


1.40i0.43 


l.lliO.37 


0.75i0.16 


0.61i0.18 


kr-vs-kp 


4.41T1.16 


2.51i0.56 


1.65i0.42 


1.34i0.37 


1.16i0.19 


led! 


35.36T2.85 


32.04i3.15 


30.06il.72 


28.61il.63 


28.09il.66 


Ied24 


40.29±7.11 


38.42i2.67 


36.95i2.76 


36.96il.32 


36.58i2.23 


sat 


19.92il.69 


17.40i0.99 


16.34i0.63 


15.93i0.56 


15.09i0.52 


segment 


9.92il.89 


7.60il.l7 


6.43i0.76 


6.02i0.43 


5.23i0.52 


sick 


2.38il.l0 


2.14i0.67 


2.03i0.43 


1.89i0.31 


1.59i0.27 


sick- euthyroid 


3.84il.09 


3.42i0.97 


3.22i0.87 


2.96i0.68 


2.80i0.20 


waveform 


11.27±1.71 


10.37il.58 


10.36il.02 


9.77i0.88 


9.99i0.61 


Data set 


60% 


70% 


80% 


90% 


100% 


allbp 


3.36i0.35 


3.16i0.37 


3.17i0.34 


3.09i0.42 


3.02i0.14 


ann 


0.40i0.09 


0.40i0.07 


0.38i0.05 


0.33i0.05 


0.27i0.04 


block 


3.33T0.24 


3.24i0.20 


3.26i0.24 


3.23i0.13 


3.24i0.09 


hypothyroid 


0.56i0.16 


0.59i0.15 


0.55i0.15 


0.51i0.09 


0.48i0.05 


kr-vs-kp 


1.00i0.21 


0.91i0.14 


0.77i0.12 


0.71i0.10 


0.60i0.08 


led! 


27.81il.48 


27.34i0.69 


27.08i0.71 


27.06i0.49 


26.92i0.26 


Ied24 


35.88il.52 


35.89il.04 


35.19i0.75 


35.23i0.76 


34.41i0.63 


sat 


15.27i0.62 


14.66i0.46 


14.54i0.53 


14.06i0.55 


13.84i0.30 


segment 


4.55i0.62 


4.11i0.31 


3.78i0.34 


3.44i0.31 


3.17i0.17 


sick 


1.52i0.23 


1.34i0.19 


1.26i0.29 


1.26i0.20 


1.22i0.09 


sick- euthyroid 


2.90i0.21 


2.78i0.44 


2.66i0.15 


2.53i0.15 


2.39i0.14 


waveform 


9.99i0.54 


9.80i0.49 


9.47i0.44 


9.21i0.44 


9.02i0.32 



In other words, in constituting an ensemble, the advantage of stronger gener- 
alization ability of pruned decision trees may be killed to some degree by its dis- 
advantage of smaller ambiguity. Thus, although an ensemble comprising pruned 
decision trees may be stronger than that comprising unpruned decision trees, 
the gap between the former and the pruned single decision trees may not be so 
big as that between the latter and the unpruned single decision trees. Therefore, 
the benefit from ensemble of impruned decision trees is usually bigger than that 
from ensemble of pruned decision trees. 



4 Conclusion 

In summary, the empirical study described in this paper discloses: 

~ Enlarging the training set tends to enlarge the benefit from Boosting but 
does not significantly impact the benefit from Bagging. This is because the 
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increase of the training set size may enhance the bias reduction effect of 
adaptive sampling but may not significantly benefit the variance reduction 
effect of bootstrap sampling. 

— The benefit from ensemble does not always increase along with the increase 
of the size of training set. This is because single learners sometimes may 
learn relatively more from randomly provided additional training data than 
ensembles do. 

~ The benefit from ensemble of unpruned decision trees is usually bigger than 
that from ensemble of pruned decision trees. This is because in constituting 
an ensemble, the relatively big ambiguity among the unpruned decision trees 
counteracts their relatively weak generalization ability to some degree. 

These findings suggest that when dealing with huge volume of data, ensemble 
learning paradigms employing adaptive sampling are more promising, adequately 
selected training data are more helpful, and the generalization ability of the 
component learners could be sacrificed to some extent if this leads to a very 
significant increase of the ambiguity. 
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Abstract. Determining the causal relation among attributes in a do- 
main is a key task in data mining and knowledge discovery. The Mini- 
mum Message Length (MML) principle has demonstrated its ability in 
discovering linear causal models from training data. To explore the ways 
to improve efficiency, this paper proposes a novel Markov Blanket iden- 
tification algorithm based on the Lasso estimator. For each variable, this 
algorithm first generates a Lasso tree, which represents a pruned candi- 
date set of possible feature sets. The Minimum Message Length principle 
is then employed to evaluate all those candidate feature sets, and the fea- 
ture set with minimum message length is chosen as the Markov Blanket. 
Our experiment results show the ability of this algorithm. In addition, 
this algorithm can be used to prune the search space of causal discovery, 
and further reduce the computational cost of those score-based causal 
discovery algorithms. 



1 Introduction 

Graphical models carrying probabilistic or causal information have a long and 
rich tradition, which began with the geneticist Sewall Wright [1,2]. There ap- 
peared many variants in different domains, such as Structural Equations Mod- 
els [3] and Path Diagrams [2] within social science, and Bayesian Networks in 
artificial intelligence. The year of 1988 marked the publication of Pearl’s influ- 
ential book [4] on graphical models, and since then, many research papers that 
address different aspects and applications of graphical models in various areas 
have been published. 

In social sciences, there is a class of limited Graphical Models, usually referred 
as Linear Causal Models, including Path Diagrams [2], and Structural Equation 
Models [3]. In Linear Causal Models, effect variables are strictly linear functions 
of exogenous variables. Although this is a significant limitation, its adoption 
allows for a comparatively easy environment in which to develop causal discovery 
algorithms. 

In 1996, Wallace et al. successfully introduced an information theoretic ap- 
proach to the discovery of Linear Causal Models. This algorithm uses Wallace’s 
Minimum Message Length (MML) criterion [5] to evaluate and guide the search 
of Linear Causal Models, and their experiments indicated that MML criterion 
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is capable of recovering a Linear Causal Model which is very close to the origi- 
nal model [6]. In 1997, Dai et al. further studied the reliability and robustness 
issues in causal discovery [7], they closely examined the relationships among 
the complexity of the causal model to be discovered, the strength of the causal 
links, the sample size of the given data set and the discovery ability of individ- 
ual causal discovery algorithms. In 2002, Dai and Li proposed a new encoding 
scheme for model structure, and Stirling’s approximation is further applied to 
simplify the computational complexity of the discovery process [8,9]. Different 
encoding schema and search strategies have been compared in [10] and empir- 
ical results revealed that greedy search works very well when compared with 
other more sophisticated search strategies. To further enhance the accuracy of 
causal discovery, Dai and Li introduced Bagging into causal discovery to improve 
the learning accuracy. The theoretical analysis and experimental results confirm 
that this method can also be used to alleviate the local minimum problem in 
discovering causal models [11]. 

Despite many successful applications, people who intend to use graphical 
models still encounter the problem of extensive computational cost. Indeed, it 
is a NP-hard problem to find the best model structure among the model space, 
except a special case that each node has no more than one parent. Clearly, 
increasing the amount of pruning and reducing the model space is the basis of 
solving this problem. 

The Markov Blanket is the minimal set of variables conditioned on which 
all other variables are independent of the particular variable. It is an important 
concept in graphical models, and a number of researchers have recently applied 
it to address the issue of feature selection [12,13]. It has been shown that the 
Markov Blanket of a variable consists of strongly relevant variables related to 
this variable. Therefore, from a learned graphical model, it is easy to identify the 
set of relevant variables employing the concept of Markov Blanket. That is, first, 
we need to learn the graphical model, then identify the Markov blanket of the 
target variable from the model structure, finally use those variables in the Markov 
blanket as the selected features. The relationship between Markov Blankets and 
the optimal feature set motivates our novel method for the identification of 
Markov blankets: first, we can use some method to find out the optimal feature 
set; then, the optimal feature set can be used as an approximation to the Markov 
Blanket. 

This paper presents a novel algorithm to identify the Markov blanket for 
each variable using the lasso estimator [14]. The rest of this paper is organized 
as follows. In Section 2 we briefly review the basic concepts of linear causal 
models and lasso estimation. In Section 3 the Minimum Message Length (MML) 
principle is then employed to chose the optimal feature set. In Section 4 the 
experimental results are given to evaluate the performance of this algorithm. 
Finally, we conclude this paper in Section 5. 
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2 Preliminaries 

2.1 Linear Causal Models and Markov Blanket 

A Linear Causal Model is a Directed Graphical Model in which every variable 
concerned is continuous. Informally speaking, it consists of two parts: Struc- 
ture, which qualitatively describes the relation among different variables; and 
Parameters, which quantitatively describe the relation between a variable and 
its parents. 

The model Structure is restricted to ones in which the inter-variable relations 
are open to a physically-tenable causal interpretation. Thus, if we define the 
“ancestors” of a variable as its parents, its parents’ parents, and so on, we require 
that no variable has itself as an ancestor. With this restriction, the Structure of 
a Linear Causal Model among those continuous variables may be represented 
by a directed acyclic graph (DAG) in which each node represent a variable, 
and there is a directed edge from Vi to Vj if and only if Vi is a parent of V}. 
The local relation between a variable and its parents is captured by a linear 
function. Specifically, if we have a set of continuous variables, a Linear Causal 
Model specifies, for each variable Vi, a possibly-empty “parent set” of it, and a 
probabilistic relation among data values, 

Ki 

Vi = ^ ^ (Xij X POij Ri (1) 

i=i 

Where Ki is the number of parents for node Vi, aij {j = 1, ... ,Ki) is the linear 
coefficient reflecting the strength of the relationship between Vt and its j-th 
parent Poij, and Ri is assumed to be identically distributed following a Gaussian 
distribution with zero mean and a standard deviation, that is Ri ^ N{0,af), so 
the set of local Parameter for a node Vi with parents is {af, an, . . . , aiK^}. On 
the other hand, for a node Vi with an empty set of parents, we assume it as a 
random sample from a Gaussian distribution, Vi ~ N{pLi,af), where pci is the 
expected value of node Vi, so the local Parameter at node Vi is {/i^, af}. 

In a linear causal model, the Markov Blanket MB{Vi) of a particular variable 
Vi is the minimal set of variables such that conditioned on MB{V), any other 
variable is independent of Vi. From the model structure, the Markov Blanket of 
a variable Vi can be easily identified: it is the union of parents and children of Vi, 
and parents of children of Vi. An example of linear causal model Markov Blanket 
is shown in Fig. 1, in which the Markov Blanket of variable V 4 is { 14 , 1 / 5 , Ve}, 
and this means that the variables V\ and V 2 are independent of V 4 conditioned 
on{y3,V^5,^6}. 

Because of the relationship between Markov Blankets and the optimal feature 
set, this paper focuses on the task of identifying Markov Blanket. Before we delve 
into the specific algorithm, we introduce the concept of the Lasso Estimator. 
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Fig. 1. An example of a Markov Blanket within a Linear causal model 



2.2 The Lasso Estimator 

Consider a regular linear regression problem with m predictor variables 
{Xi,...,Xm} and one response variable Y, and the training data is 
{xki, ■ ■ ■ ,Xkm,yk),k = In this situation, the ordinary least squares 

(OLS) regression finds the linear combination of all those m predictor vari- 
ables that minimizes the residual sum of squares. A closely related optimization 
problem is 

n m 

= arg min a^x^jY 

k=l i=l 

where = {a°, . . . , a%^} is the OLS regression coefficient vector. 

However, when the predictor variables are highly correlated or when m is 
large, the variances of the least squares coefficient may be unacceptably high. 
Standard solutions to this difficulty include ridge regression and feature selec- 
tion. As an alternative to standard ridge regression and subset selection tech- 
niques, Tibshirani proposed the Least Absolute Shrinkage and Selection Operator 
(Lasso) [14], which is a constrained version of the ordinary least squares. The 
Lasso estimator solves the following optimization problem 

n m m 

a{t) = argmm{'^{yk -^ajXkjf : X! ^ 

k=i j=i j=i 

where t is a tuning parameter, if t is greater than or equal to li lasso 

regression coefficients are the same as the OLS coefficients. For smaller values of 
t, the Lasso estimator shrinks the regression coefficient vector a{t) towards the 
origin, and usually puts some of the regression coefficients to be zero. 

Fig. 2(a) shows all Lasso solutions a{t) for the variable V 4 in the Blau model, 
as t increases from 0, where d = 0, to f = 0.86558, where d equals the OLS 
regression coefficient vector. Fig. 2(b) shows all Lasso solutions for the variable 
Vi- We see that the Lasso estimator tends to shrink the regression coefficients 
toward 0, more so for small values of t. 
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(a) Lasso Tree for Variable V'i 




'-'"i 

(b) Lasso Tree for Variable Vi 



Fig. 2. Lasso Trees for variables in Blau model 



Thus the Lasso estimator has a parsimony property: for any given constraint 
parameter t, only a subset of predictor variables have non-zero coefficients, and 
this property makes the Lasso a useful tool for feature selection. One difficulty 
faced by many regular feature selection algorithms is the exponential growth 
of the size of the power set of the predictor variables. However, this difficulty 
has been gracefully avoided by Lasso estimation: Along with the shrinkage of t, 
normally the regression coefficients will become 0 one by one. Thus, the number 
of candidate feature sets is usually equal to the number of predictor variables, 
rather than the size of the power set of the predictor variables. This is evident 
from Fig. 2, for example, in Fig. 2(a), the possible feature sets for variable V 4 
are {}, {^ 5 }, {^ 5 , ^e}, {^ 5 , ^6, ^ 3 }, {^ 5 , ^6, ^ 3 , and {^ 5 , ^6, ^ 3 , Vi, ^ 2 }- 

Now, the problem is how to find an optimal feature set from those candidate 
sets. 

3 Choosing the Optimal Feature Set Using the MML 
Principle 

A fundamental problem in model selection is the evaluation of different mod- 
els for fitting data. In 1987, Wallace and Freeman [15] extended their earlier 
method [5] for model selection based on Minimum Message Length (MML) 
coding. The basic idea of the MML principle is to find the hypothesis H which 
leads to the shortest message M. Its motivation is that this method can auto- 
matically embrace selection of an hypothesis of appropriate complexity as well 
as leading to good estimates of any free parameters of the hypothesis. 

In this section we apply the MML principle to choosing the optimal feature 
set. This requires the estimate of the optimal code for describing the model 
and the data set, and it can be considered as a specific case of the general 
MML application to discover the linear causal model [9]. To ease exposition, 
suppose the candidate feature set with m predictor variables {Ai,...,Am}, 
and Y is the response variable, the data set D consists of n instances, i.e.. 



Identifying Markov Blankets Using Lasso Estimation 313 



D = {(xfci, . . . , Xkm, Vk), k = . ,n}. Then the relation between Y and its m 

predictor variables can be captured by the following linear function 

m 

F = ^ a,X, + R (2) 



Where R represents the regression residuals, and they are assumed to be i.i.d. 
and follow some Gaussian distribution N(0,cr^). All those local parameters 6 = 
{cti, • • • , am. O'} can be estimated by the ordinary least squares (OLS) regression 
which attempts to minimize the residual sum of squares. 

Therefore, for this linear model with local parameter 0 = {ai, • • • , Om, o’}, 
the encoding length for this model becomes 



msgLen{9i) = 



, h(9) m + 1 , TO + 1 



( 3 ) 



where the term h(9) is the prior probability of parameter 9, the term is 

the (to + l)-dimensional optimal quantizing lattice constant [16], and the term 
\F{0)\ is the determinant of the empirical Fisher information matrix [17] (page 
96), which can be calculated by 



r(9) = (f'k) (4) 

where 9 = {oi, . . . , am, o}, X is the n x m matrix consisting of n observations 
of TO predictor variables, and we can use a, m x m matrix A to represent X'X. 
So the determinant of F{9) becomes 

In this paper, we use a uniform prior over the finite parameter space, h(9) = 
Thus, the encoding length for this linear model is 

r 1, 1, I 1 2 TO+1, TO+1 

msgLen{9) = -log{-) + -log\A\- —log a + ^— log + ^— (5) 

Given the local parameter 9, the encoding length for the training data D is 



msgLen{D\9) = msgLen{R\9) 

= — log Prob{R\9) 



n , „ n , 9 

= - log 27T + - log cr 



E 

i=i 



(%■ 



2ct2 



( 6 ) 



N log 6r 



where is the measurement accuracy of regression residual R, and it can be 
ignored when comparing two models. 

Thus, for each candidate feature set, we can estimate its minimum message 
length by msgLen{9) + msgLen{D\9), where 9 is the set of local parameters 
related to this feature set. Because the number of candidate feature set returned 
by lasso estimator is usually equal to the number of predictor variables, we can 
compare each candidate set by simply calculating their corresponding encoding 
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message length. As such, the candidate with the minimum message length will 
be chosen as the optimal feature set. 

As pointed out in Section 1, the close relation between Markov Blanket and 
the optimal feature set leads to a novel Markov Blanket identification algorithm, 
that is, the chosen optimal feature set can be used as an approximation to the 
Markov Blanket. The pseudo code of this algorithm is shown in Alg. 1. 



Algorithm 1 Lasso-based Markov Blanket Identification 
Require: A lasso estimator Lasso, training set D, response variable Y . 

Ensure: MB = Markov Blanket of Y 

CFS = Lasso{D,Y) {Generate Candidate Feature sets using Lasso estimation} 
OFS = full set of predictor variables 
for each CFSi G CFS do 

if msgLen{D,CFSi) < msgLen{D,OFS) then 
OFS = CFSi 
end if 

end for {Iterate each candidate feature set to choose the optimal one} 

MB = OFS {The Markov Blanket can be approximated by the optimal feature set.} 



4 Experimental Result 

In this section, the performance of the Lasso-based Markov Blanket identification 
(LMB) algorithm is evaluated. First, a full set of MB identification result for 
the Verbal& Mechanical Ability model is given to show the ability of our LMB 
algorithm. Then, different data sets are tested to establish a general performance 
measure for this algorithm. 



4.1 A Case Study: Verbal and Mechanical Ability Model 

Fig. 3(a) illustrates the Vcrbal& Mechanical Ability model described by 
Loehlin [18]. In this model, variable Vi, V 2 and V 3 are three hypothetical vari- 
ables introduced by factor analysis, while V 4 and V 5 are two verbal tests, and Vq 
and V 7 are two mechanical tests. 

For this case, the Lasso-based Markov Blanket identification algorithm is 
applied to find the Markov Blanket for each variable. Fig. 3 shows all those 
results: Fig. 3(b) gives the lasso tree for variable Vi, the MML selection chooses 
the optimal feature set as {V 2 , V 3 }; Fig. 3(c) gives the lasso tree for variable V 2 , 
the LMB algorithm finds the optimal feature set as {Vi, V 4 , V 5 }, as indicated by 
the vertical line at t = 0.83. 

From these figures, it is clear that the algorithm closely recovered all those 
Markov Blankets of variables in the Verbal&Mechanical Ability model. 
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« : i. 

■ t|,i 

(c) MB Result for Variable V 2 




(e) ,V1 B lleaiilt for Variable V-ti 





IK,” 



(g) MB Result for Variable V6 




Fig. 3. A Case Study for Learning MB using Lasso Estimator 
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4.2 Performance Evaluation 

In order to get a general idea of this algorithm’s performance, eight linear causal 
models reported in related literature [6,8] are re-examined: Fiji, Evans, Blau, 
Goldberg, V&M Ability, case9, caselO and casel2. The details of these models 
are described in Table 1. 



Table 1. Information of Examined Data Set 



Data Set 


Number of Variables 


Fiji 


4 


Evans 


5 


Blau 


6 


Goldberg 


6 


V&M Ability 


7 


CaseQ 


9 


CaselO 


10 


Casel2 


12 



The identified Markov Blankets can be compared against the actual Markov 
Blankets in terms of recall rate and exclude rate, where the recall rate is the ratio 
of correctly identified Markov Blanket variables over the true Markov Blanket 
variables, and the exclude rate is the ratio of identified non-Markov Blanket 
variables over the true non-Markov Blanket variables. 



Table 2. Result of LMB algorithm over examined Data set 



Data Set 


Recall Rate Exclude Rate 


Fiji 


1 




Evans 


0.88 


1 


Blau 


0.94 


0.92 


Goldberg 


1 


1 


V&M Ability 


1 


1 


CaseO 


0.91 


0.93 


CaselO 


1 


1 


Casel2 


1 


1 



Table 2 gives the results of LMB algorithm over examined data sets: Out 
of these eight models, the structure of the Fiji model is a complete DAG, and 
any variable belonging to the Markov blanket of any other variable, so it is 
impossible to calculate the exclude rate. Five out of eight models can be correctly 
identified by LMB algorithm. For the Evans model, all those non-Markov blanket 
variables can be excluded, however, some of the Markov blanket variables were 
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also excluded by mistake, such that the recall rate is less than 1. For model 
Blau and Case9, their recall rates and exclude rates are all less than 1, and this 
indicates that fact that some non-Markov blanket variables were selected while 
some Markov blanket variables were excluded. 

5 Conclusion and Future Work 

This paper presented a novel algorithm (LMB) to identify Markov Blanket using 
Lasso estimator. The experimental results reported in this paper show that in 
general this algorithm can correctly distinguish those Markov blanket variables 
from those non-Markov Blanket variables. 

We take these results to be significant confirmation that Lasso can be used 
as a useful assisting tool for the discovery of linear causal model. Future work 
can be carried out on the following aspects: 

— First, the identification result can be further refined using some characters of 
Markov Blanket, such as, if a variable X is in the Markov Blanket of variable 
Y, then the variable Y is also in the Markov Blanket of variable X. 

— Second, the identified Markov Blanket can be used to prune the search space 
of model selection. This is a potential method to further increase the effi- 
ciency of linear causal model discovery. 

— Third, how to distinguish parent variables from the other variables in Markov 
blanket can be another interesting topic. 

— Finally, the reason that the MML principle plays very well for the identifi- 
cation of Markov Blankets has not been studied, and its comparative per- 
formance with the cross-validation selection strategy also calls for further 
experimental work. 
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Abstract. The naive Bayes classifier is widely used in interactive 
applications due to its computational efficiency, direct theoretical base, 
and competitive accuracy. However, its attribute independence assump- 
tion can result in sub-optimal accuracy. A number of techniques have 
explored simple relaxations of the attribute independence assumption 
in order to increase accuracy. TAN is a state-of-the-art extension of 
naive Bayes, that can express limited forms of inter-dependence among 
attributes. Rough sets theory provides tools for expressing inexact or 
partial dependencies within dataset. In this paper, we present a variant 
of TAN using rough sets theory and compare their tree classifier struc- 
tures, which can be thought of as a selective restricted trees Bayesian 
classifier. It delivers lower error than both pre-existing TAN-hased 
classifiers, with substantially less computation than is required by the 
SuperParent approach. 

Keywords: Naive Bayes, Bayesian Network, Machine Learning, 



1 Introduction 

A classification task in data mining is to build a classifier which can assign a 
suitable class label to an unlabelled instance described by a set of attributes. 
Many approaches and techniques have been developed to create a classification 
model. The naive Bayesian classifier is one of the most widely used in interactive 
applications due to its computational efficiency, competitive accuracy, direct the- 
oretical base, and its ability to integrate the prior information with data sample 
information [1,7,3,5,18,15,14]. It is based on Bayes’ theorem and an assumption 
that all attributes are mutually independent within each class. Assume A is a 
finite set of instances, and A = {Ai, A 2 , • • • , A„} is a finite set of n attributes. 
An instance a; G A is described by a vector < oi,a 2 ,---,a„ >, where is a 
value of attribute Ai. C is called the class attribute. Prediction accuracy will be 
maximized if the predicted class 

L(x) = argmaXc(P(c |< ai, 02 , • • • , a„ >). (1) 
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Unfortunately, unless the vector occurs many times within X, it will not be pos- 
sible to directly estimate P{c |< ai, 02 , • • • , a„ >) from the frequency with which 
each class c G C co-occurs with < oi, 02 , • • • , a„ > within the training instances. 
Bayes’ theorem provides an equality that might be used to help estimate the 
posterior probability P{ci \ x) in such a circumstance: 

P{ci \x) = a - P{ci) ■ P{< ai, 02 , • • • ,a„ >| Ci) (2) 

where P{ci) is the prior probability of class Cj € C, P(< oi, 02 , • • • , a„ >| Ci) is 
the conditional probability oi x G T given the class Ci, and a is a normalization 
factor. According to the chain rule, equation 2 can be written as: 

n 

P{a \x) = a - P{ci) ■ P{ak I ai,a 2 , • • • ,afc-i,Ci) (3) 

k^l 

Therefore, an approach to Bayesian estimation is to seek to estimate each P{ak \ 

: ^2 5***5 ^k — 1 5 ^2 ) ■ 

If the n attributes are mutually independent within each class value, then 
the conditional probability can be calculated in the following way: 

n 

P{< ai,a2,* * * >1 Ci) = P{ak \ q). (4) 

k^l 

Classification selecting the most probable class as estimated using formulae 2 
and 4 is the well-known naive Bayesian classifier. 

2 Approaches of Improving Naive Bayesian Method 

In real world problems, the performance of a naive Bayesian classifier is domi- 
nated by two explicit assumptions: the attribute independence assumption and 
the probability estimation assumption. Of numerous proposals to improve the 
accuracy of naive Bayesian classifiers by weakening its attribute independence 
assumption, both Tree Augmented Naive Bayes(TAA^) [4,3,5] and Lazy Bayesian 
Kules{LBR) [18] have demonstrated remarkable error performance [14]. Fried- 
man, Geiger and Goldszmidt presented a compromise representation, called tree- 
augmented naive Bayes {TAN , simply called the basic TAN), in which the class 
node directly points to all attributes’ nodes and an attribute node can have 
only at most one additional parent to the class node. Keogh & Pazzani took a 
different approach to constructing tree-augmented Bayesian networks [5] (simply 
called Super Parent or SP). The two methods mainly differ in two aspects. One 
is the criterion of attribute selection used to select dependence relations among 
the attributes while building a tree-augmented Bayesian network. Another is 
the structure of the classifiers. The first one always tends to construct a tree 
including all attributes, the second one always tends to construct a tree with 
fewer dependence relations among attributes and better classification accuracy. 
Zheng and Webb proposed a lazy Bayesian rule (LBR) learning technique [18]. 
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LBR can be thought of as applying lazy learning techniques to naive Bayesian 
rule induction. Both LBR and TAN can be viewed as variants of naive Bayes 
that relax the attribute independence assumption [14]. 

In this paper, however, we concentrate on the eager strategy, which holds a 
computational advantage when a single model can be applied to classify many 
instances. First of all, we mainly analyze the implementations of two different 
TAN classifiers and their tree classifier structures, and experimentally show 
how different dependence relations impact on the accuracy of TAN classifiers. 
Second, based on the definition of dependence in the basic rough set theory, we 
propose a definition of dependence measurement given the class variable, and 
use it to build a new dependence relation matrix. We believe that the directions 
of dependence relations are very important for performance of a classifier. Using 
this kind of definition, we can actually gain a directed-graph description. Third, 
we present a new algorithm for building selective augmented Bayesian network 
classifiers, which reduce error relative to the TAN classifiers, and has similar 
computational overheads. Experimental results also show that can deliver some 
improvements on performance, while requiring substantially less computation. 



3 Some Issues in the Implementations 



Now, we discuss some extended issues in the implementations of the T AN classi- 
fiers. First of all, the problem is related to the probability estimation assumption. 
In the basic TAN, for each attribute we assess the conditional probability given 
the class variable and another attribute. This means that the number of in- 
stances used to estimate the conditional probability is reduced as it is estimated 
from the instances that share three specific values (the class value, the parent 
value and the child value) . Thus it is not surprising to encounter unreliable esti- 
mates, especially in small datasets. Friedman, Geiger and Goldszmidt dealt with 
this problem by introducing a smoothing operation [3] . There is a problem with 
this strategy when attribute value a does not occur in the training data (this 
situation can occur during cross validation testing), the value of the estimate 
will be zero. In our implementation, we use both these smoothing adjustments 
to estimate any conditional probability with |7r(a)| = 2, i.e.. 



P{a I 7r(a)) 



counts{a, 7r(a)) -I- N^ ■ 

counts{'K{a)) + N^ 



( 5 ) 



where |A| is the number of values for attribute A. We use Laplace estimation 
to estimate any other probability. In Keogh and Pazzani’s SuperParent algo- 
rithm, they replace zero probabilities with a small epsilon (0.0001). Kohavi, 
Becker and Sommerfield [8] have shown that different methods for estimating 
the base probabilities in naive Bayes can greatly impact upon its performance. 
Similarly, different estimation methods will gain different effects on the same 
TAN classification model. We think that estimation methods should depend 
on the distribution of variables, as well as the number of training instances to 
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support these kinds of estimation methods. Estimation methods should be in- 
dependent of classification models for the same probability. In order to compare 
the performances of classification models, we use the same method to estimate 
probabilities. In all the algorithms mentioned in our experiments, we always use 
formular 5 to estimate any conditional probability with |7r(a)| = 2 and only 
standard Laplace estimation applied to any other probability. 

Secondly, regarding the problem of missing values, in both the basic TAN 
classifiers and Super Parent classifiers, instances with missing values were 
deleted from the set of training instances. We keep all the instances, but ig- 
nore missing values from the counts for missing attribute values. Also, when we 
estimate a conditional probability P{ak \ ci), for a prior probability of class value 
Ci we exclude the occurrences of class value Ci with missing values on attribute 
Afc. Obviously, this makes the estimation of the condition more reliable while 
estimating any conditional probability. 

Thirdly, although the choice of root variable does not change the log- 
likelihood of the basic TAN network, we have to set the direction of all edges for 
classification. When each edge (Ai, Aj) is added to the current tree structure, we 
always set the direction from Ai to Aj [i < j) at once. In Keogh and Pazzani’s 
Super Parent classifiers, the direction of an arc is always from the super parent 
node to the favorite child node. That means that when an dependence relation 
is singled out, it always has the specific direction. 

4 Selective Augmented Bayesian Classifiers 

There are further factors that influence the performance of an augmented naive 
Bayesian classifier. The first one is the criterion for selecting dependence relations 
among the attributes. The second one is the criterion of terminating the selection. 
In this section, we describe our new algorithm for selective augmented Bayesian 
classifiers, explain how it works and is different from the basic TAN classifiers 
and SuperParent classifiers, and experimentally show its better performance 
and preferable computational profile. 

4.1 Dependence Relation Matrix Based on Rough Set Theory 

Friedman, Geiger and Goldszmidt explain why they use the conditional mu- 
tual information as the criterion of selecting dependence relations among the 
attributes [3,4]. One problem with this criterion, as mentioned above, is how 
to decide the directions of dependence relations. Keogh and Pazzani use leave- 
one-out cross-validation to handle this problem in the process of building a 
classifier [5,6]. When the best super parent and its favorite child are found, the 
dependence relation between them is naturally from the best super parent to 
its favorite child, i.e., the favorite child is depends on the corresponding best 
super parent. For each arc to be added into the structure, SuperParent needs 
many times to execute leave-one-out cross-validations on the whole set of train- 
ing instances. Although they proposed some shortcuts to speed up the process 
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of evaluating many classifiers, the algorithm is still very time consuming. In our 
algorithm, we will use a dependence measurement based on rough sets theory. 

Rough set [11] provides, in particular, tools for expressing inexact or partial 
dependencies within dataset. Given a dataset of some description or measure- 
ments concerning available instances, rough set methods enable to extract depen- 
dence relations between corresponding attributes or variables. These dependence 
relations can be applied to inductive reasoning about new, so far unseen cases, 
in a way well understandable for the user. Above advantages, as well as very 
effective computational framework for extraction of the most interesting depen- 
dencies from real-life data, cause a rapid development of applications of rough 
sets to more and more scientific fields and practical tasks [13]. 

According to the values of class variable, we define a new dependence relation 
matrix, in which each item of conditional dependence relation, D{Ai,Aj \ C), 
can be described as follows: 



D{A,Aj I C) 



y I I 



( 6 ) 



where POS^\Aj) represents the positive region of attribute Aj relative to at- 
tribute Ai within the class value [11]. Using this kind of definition, we can 
actually gain a directed-graph description. Each item not only reflects the degree 
of the dependence between two attributes, but also tells us the direction of the 
dependence relation. 



4.2 A New Selective Augmented Bayesian Algorithm 

In an augmented Bayesian classifier, the second important issue is how to decide 
the candidate dependence set and when terminate the the selection. There are 
n(n — 1) different dependence relations among n attributes. When there are 
n — 1 or no any edge with the conditional mutual information more than 0, a 
basic TAN structure will have n — 1 arcs. We also try to add n — 1 arcs to our 
augmented Bayesian classifier, but the candidate set is different from the basic 
TAN. Because the way of weighting a candidate arc and the way of setting the 
direction of an arc are different from the basic TAN structure, the candidate 
arc set is different. 

Based on above discussions, we can describe our selective augmented 
Bayesian algorithm, simply called Select in all tables, as follows. 

1. Compute the dependence relation matrix conditional mutual information 
using formula 6. 

2. Select an arc using a near maximum branching directed arborescence alo- 
gorithm [10], based on the dependence relation matrix. 

3. Use leave-one-out cross-validation to evaluate the current set of arbores- 
cences to decide whether this arc should be added into the current structure or 
not, adding it only if doing so reduces cross-validation error. 

4. Repeat the previous iteration n — 1 times, or until no more arcs can be 
tested. 
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An important difference from the basic TAN algorithm is that this algorithm 
tries to build a maximum directed branch or arborescence [10], not a maximum 
undirected spanning tree. We believe that the direction of dependence relations 
is a critical issue to minimizing error. In the procedure of building an augmented 
Bayesian classifier the network is always a directed structure. 



5 Experimental Results 

There are thirty-two natural domains used in our experiments, showir in Table 1. 
Twenty-nine of these are totally drawn from the previous research paper [18] . The 
other three {Satellite, segment, and Shuttle) are larger datasets. “S']]” means 
the number of instances. “Cjl” means the number of values of a class attribute. 
“Aji” means the irumber of attributes, not including the class attribute. All the 
experiments were performed in the Weka system [17]. The error rate of each 
classification model on each domain is determined by ruirniirg 10-fold cross- 
validation on a dual-processor l.TGhz Pentium 4 Liirux computer with 2Gb 
RAM. We use the default discretization method “weka. filters. DiscretizeFilter” 
as the discretizatioir method for coirtiiruous values, which is based oir Fayyad 
aird Irairi’s method [2]. 

Table 1 also shows the error rates of iraive Bayes classifier {NB), the basic 
TAN classifier, the Super Parent clssifier {SP), aird our selective augmeirted 
Bayesiair classifier {Select) oir each domaiir, respectively. The last row coirtaiirs 
the mean error rates for each columir. The best one for a given dataset is shown 
iir bold text. It shows the selective augmeirted Bayesian classifier has the lowest 
mean error rate. Table 2 presents the WIN/LOSS/DRAW records for comparing 
with all other classifiers. This is a record of the number of data sets for which the 
nominated algorithm achieves lower, higher, and equal error to the comparison 
algorithm, measured to two decimal places. The table also includes the outcome 
of a two-tailed binomial sign test. This indicates the probability that the ob- 
served outcome or more extreme should occur by chance if wins and losses were 
equiprobable. The selective augmented Bayesian classifier demonstrates the best 
performance. In the thirty- two databases, there are nineteen databases which 
the selective augmented Bayesian classifier has higher classification accuracy 
than Naive Bayes, twenty-one databases than the basic TAN classifier, sixteen 
databases thair the SuperParent classifier. It is remarkable that there are eigh- 
teen databases which the basic TAN classifier has higher classification accuracy 
than Naive Bayes, and twelve datasets worse than Naive Bayes. 

This suggests that selective augmented Bayesiair classifier has similar error 
to the SuperParent classifier, but the selective augmented Bayesian classifier 
is much more efficient than the SuperParent classifier. Table 3 shows the time 
of building each classifier. On the most of domains, the selective augmented 
Bayesian classifier is much faster than the SuperParent classifier. 

Table 4 shows the number of arcs in the Bayesian network built by the 
SuperParent classifier {SP), the basic TAN classifier, and our {Select) on each 
domain, respectively. These experimental results show that the basic TAN al- 
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Table 1. Descriptions of Data and Average Error Rates 





Domain 


s« 


c« 


Att 


M 


NB 


SP 


TAN 


Select 


1 


Anealing 


898 


6 


38 


Y 


5.46 


3.3 


4.34 


4.12 


2 


Audiology 


226 


24 


69 


Y 


29.20 


27.88 


27.43 


26.55 


3 


Breast Cancer 


699 


2 


9 


Y 


2.58 


2.58 


5.01 


2.58 


4 


Chess(kr-vs-kp) 


3169 


2 


39 


N 


12.36 


5.19 


6.54 


8.01 


5 


Credit 


69 


2 


15 


Y 


15.07 


15.22 


14.93 


15.23 


6 


Echocardiogram 


131 


2 


6 


Y 


27.48 


29.01 


35.88 


28.24 


7 


Glass 


214 


7 


9 


N 


41.12 


41.59 


37.85 


35.98 


8 


Heart 


270 


2 


13 


N 


15.19 


16.30 


20.74 


17.04 


9 


Hepatitis 


155 


2 


19 


Y 


16.13 


16.13 


11.61 


14.84 


10 


Horse Colic 


368 


2 


21 


Y 


20.11 


19.29 


19.84 


19.02 


11 


House Votes 84 


435 


2 


16 


N 


9.89 


6.67 


7.59 


9.20 


12 


Hypothyroid 


3163 


2 


25 


Y 


2.94 


2.81 


2.66 


2.75 


13 


Iris 


150 


3 


4 


N 


6.00 


6.67 


5.33 


8.00 


14 


Labors 


57 


2 


16 


Y 


3.51 


3.51 


12.28 


3.51 


15 


LED 


1000 


10 


7 


N 


26.20 


26.60 


25.90 


26.70 


16 


Bupa 


345 


2 


6 


N 


36.81 


38.84 


37.68 


39.13 


17 


Lung Cancer 


32 


3 


56 


Y 


46.88 


50.00 


46.88 


43.75 


18 


Lymphography 


148 


4 


18 


N 


14.19 


14.86 


18.92 


12.84 


19 


BID 


768 


2 


8 


N 


25.00 


25.52 


25.00 


24.74 


20 


Post Operative 


90 


3 


8 


Y 


28.89 


31.11 


35.56 


30.00 


21 


Primary Tumor 


339 


22 


17 


Y 


48.97 


51.15 


54.28 


50.15 


22 


Promoters 


106 


2 


57 


N 


8.49 


8.48 


16.04 


11.32 


23 


Satellite 


6435 


6 


36 


N 


18.90 


12.18 


12.29 


12.15 


24 


Segment 


2310 


7 


19 


N 


11.08 


7.01 


6.15 


5.54 


25 


Shuttle 


58000 


7 


9 


N 


10.07 


5.09 


8.06 


6.95 


26 


Solarfiare 


1389 


3 


10 


N 


3.89 


1.08 


1.08 


1.87 


27 


Sonar 


208 


2 


60 


N 


25.48 


25.96 


29.33 


23.08 


28 


Soybean 


683 


19 


35 


Y 


7.17 


6.59 


10.98 


6.44 


29 


Splice 


3177 


3 


60 


N 


4.66 


4.50 


4.60 


4.41 


30 


TTT 


958 


2 


9 


N 


29.54 


27.45 


26.10 


29.54 


31 


Wine 


178 


3 


13 


N 


3.37 


3.37 


3.93 


2.81 


32 


Zoology 


101 


7 


16 


N 


5.94 


6.93 


4.95 


6.93 




Error Mean 










17.58 


16.93 


18.11 


16.67 



ways tends to construct a tree including all attributes, the SP always tends to 
construct a tree with fewer dependence relations among attributes and better 
classification accuracy. In order to show the candidate arc set and the structure 
are different from the basic TAN classifier, we give an example produced by the 
basic TAN algorithm, the Super Parent algorithm, and our selective augmented 
Bayesian algorithm on dataset Soybean respectively. The basic TAN algorithm 
produced a tree with n — 1 arcs (n = 35), shown in figure 1, where a node 
which parent is node 0 and has no any child is omitted for simplicity. This tree 
includes all the attribute nodes. The selective augmented Bayesian algorithm 
produced eight arcs in four branches, shown in figure 2. The structure of this 
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Table 2. Comparison of Select to others 





WIN 


LOSS 


DRAW 


P 


NaiveBayes 


19 


10 


3 


0.1360 


Super Parent 


16 


11 


5 


0.4420 


TAN 


21 


11 


0 


0.1102 



Table 3. Time in CPU Seconds 



Domain SP TAN Select 



1 Anealing 


110.31 


0.17 


13.95 


2 Audiology 


530.96 


1.79 126.38 


3 Breast Cancer 


1.20 


0.17 


0.40 


4 Chess 


662.98 


1.14 


61.81 


5 Credit 


5.02 


0.14 


1.35 


6 Echocardiogram 


0.07 


0.03 


0.10 


7 Glass 


0.37 


0.11 


0.31 


8 Heart 


0.31 


0.11 


0.25 


9 Hepatitis 


2.13 


0.11 


1.59 


10 Horse Golic 


6.78 


0.13 


1.42 


11 House Votes 


1.51 


0.10 


1.50 


12 Hypothyroid 


130.58 


0.25 


28.8 


13 Iris 


0.07 


0.02 


0.07 


14 Labor 


0.09 


0.09 


0.17 


15 LED 


1.48 


0.07 


1.54 


16 Bupa 


0.24 


0.06 


0.13 


17 Lung Cancer 


7.23 


0.14 


5.56 


18 Lymphography 


0.36 


0.10 


1.52 


19 PID 


1.37 


0.12 


1.32 


20 Post Operative 


0.10 


0.02 


0.11 


21 Ptn 


2.63 


0.11 


5.22 


22 Promoters 


11.75 


0.34 


10.77 


23 Satellite 


6247.56 


4.9 269.66 


24 Segment 


202.69 


1.84 


24.36 


25 Shuttle 


137.68 


5.14 


79.82 


26 Solarflare 


8.85 


0.07 


2.56 


27 Sonar 


165.06 


1.88 


22.62 


28 Soybean 


51.59 


0.43 


51.63 


29 Splice 


1090.46 


3.71 313.44 


30 TTT 


1.07 


0.06 


0.53 


31 Wine 


0.17 


0.13 


0.48 


32 Zoo 


1.80 


0.09 


0.61 



Table 4. Arcs for each algorithm 



Domain SP TAN Select 



1 Anealing 


9 


27 


12 


2 Audiology 


6 


61 


11 


3 Breast Cancer 


0 


8 


1 


4 Chess 


8 


35 


17 


5 Credit 


4 


13 


7 


6 Echocardiogram 


0 


5 


17 


7 Glass 


3 


5 


7 


8 Heart 


0 


10 


3 


9 Hepatitis 


2 


18 


3 


10 Horse Golic 


4 


18 


7 


11 House Votes 


1 


15 


3 


12 Hypothyroid 


7 


24 


14 


13 Iris 


0 


3 


1 


14 Labor 


0 


10 


3 


15 LED 


1 


6 


4 


16 Bupa 


1 


2 


1 


17 Lung Cancer 


2 


19 


6 


18 Lymphography 


1 


17 


4 


19 PID 


1 


5 


1 


20 Post Operative 


2 


7 


3 


21 Ptn 


0 


16 


5 


22 Promoters 


0 


56 


4 


23 Satellite 


35 


35 


35 


24 Segment 


13 


15 


17 


25 Shuttle 


2 


6 


8 


26 Solarflare 


6 


9 


8 


27 Sonar 


3 


56 


7 


28 Soybean 


2 


34 


8 


29 Splice 


2 


59 


7 


30 TTT 


1 


8 


0 


31 Wine 


0 


12 


2 


31 Zoo 


2 


15 


3 



Bayesian classifier is a forest, where some of arcs belong to the TAN structure, 
but some others do not belong to the TAN structure. This example also shows 
our selective augmented Bayesian algorithm produces different results from the 
sTAN method described by Keogh and Pazzani [6]. In the sTAN model, they 
only select from those edges that appear in the basic TAN tree structure. The 
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Fig. 1. The Bayesian network of TAN on dataset Soybean 




Fig. 2. The Bayesian network of selec- Fig. 3. The Bayesian network of super 
five augmented TAN on dataset Soy- parent TAN on dataset Soybean 
bean 



result of the SuperParent algorithm is shown in figure 3. The SuperParent 
algorithm only uses leave-one-out cross validation to determine the directions of 
arcs, but it can not always obtain better performance. For example, on dataset 
PID, the Bayesian networks built by both the SuperParent algorithm and the 
selective augmented Bayesian algorithm have only one arc, but the directions 
are different. The tree structure built by the SuperParent algorithm is only 
the arc: (4, 1), but the selective augmented Bayesian algorithm produces one arc 
with reverse direction. 



6 Conclusion 

In this paper we have investigated three issues that affect the quality of TAN 
classifiers learned from data. The first issue is how to estimate the base probabil- 
ities. We conclude that it is important to use a consistent estimation to compare 
with each other. The second is how to measure dependence relations between two 
attributes with directions. Based on the definition of dependence degree in the 
basic rough set theory, we propose a definition of dependence measurement given 
class variable for building a classification model. Thirdly, we mainly present a 
selective augmented Bayesian network classifier that reduces error relative to the 
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original TAN, and with similar computational overheads, but much lower com- 
putational overheads than the Super Parent, which is a state-of-the-art variant 
of the basic TAN classifier. Experimental results show that it can deliver some 
improvements on performance, while requiring substantially less computation. 
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Abstract. To gain access to account privileges, an intruder masquer- 
ades as the proper account user. This paper proposes a new strategy for 
detecting masquerades in a multiuser system. To detect masquerading 
sessions, one profile of command usage is built from the sessions of 
the proper user, and a second profile is built from the sessions of the 
remaining known users. The sequence of the commands in the sessions 
is reduced to a histogram of commands, and the naive-Bayes classifier 
is used to decide the identity of new incoming sessions. The standard 
naive-Bayes classifier is extended to take advantage of information from 
new unidentified sessions. On the basis of the current profiles, a newly 
presented session is first assigned a probability of being a masquerading 
session, and then the profiles are updated to reflect the new session. As 
prescribed by the expectation-maximization algorithm, this procedure is 
iterated until both the probabilities and the profiles are self-consistent. 
Experiments on a standard artificial dataset demonstrate that this 
self-consistent naive-Bayes classifier beats the previous best-performing 
detector and reduces the missing-alarm rate by 40%. 

Keywords: self-consistent naive-Bayes, expectation-maximization 

algorithm, semisupervised learning; masquerade detection, anomaly 
detection, intrusion detection. 



1 Introduction 

This paper presents a simple technique for identifying masquerading sessions 
in a multiuser system. Profiles of proper and intruding behavior are built on 
the sessions of known users. As new, unidentified sessions are presented, the 
profiles are adjusted to reflect the credibility of these new sessions. Based on the 
expectation-maximization algorithm, the proposed self-consistent naive-Bayes 
classifier markedly improves on earlier detection rates. 

1.1 Motivation 

Unauthorized users behave differently from proper users. To detect intruders, 
behaviors of proper users are recorded and deviant behaviors are flagged. Typi- 
cally, each user account is assigned to a single proper user with a specified role. 
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Hence, future sessions under that user account can be compared against the 
recorded profile of the proper user. This approach to detecting masquerades is 
called anomaly detection. 

More generally, intrusion detection is amenable to two complementary ap- 
proaches, misuse detection and anomaly detection. Misuse detection trains on 
examples of illicit behavior; anomaly detection trains on examples of proper be- 
havior. Because examples of intrusions are generally rare and novel attacks can- 
not be anticipated, anomaly detection is typically more flexible. In the context 
of masquerade detection, anomaly detection is especially appropriate because 
each user account is created with an assigned proper user, whose behavior must 
not deviate far from the designated role. 



1.2 Previous Approaches 

The artificial dataset used in this study has been studied by many researchers 
[3,4,7,9,11,12]. Subsection 3.1 presents a summary of the dataset, and full details 
are available in [11]. Before this paper, the best result was achieved by using a 
naive-Bayes classifier with updating [9]. Yung [14] markedly improved detection 
rates but relied on confirmed updates. This paper also substantially improves 
over [9] and is competitive with [14]. 

There are at least two limitations to the updating scheme used in [9] and 
[11]. Principally, the detector must make a concrete decision whether to update 
the proper profile with the new session. In many cases, the judgment of a session 
is not always clear. Yet the detector is forced to make a binary decision before 
the detector can continue, because future decisions depend on the decision for 
the current case. 

Furthermore, all future decisions depend on the past decisions, but the con- 
verse does not hold. In principle, the detector can defer the decision on the 
current session until additional new sessions from a later stage are studied. Yet 
in both [9] and [11], the detector is forced to make an immediate, greedy deci- 
sion for each new session, without considering the information from future cases. 
Scores for previous sessions cannot be revised as new sessions are encountered. 
Hindsight cannot improve the detector’s performance on earlier sessions, because 
backtracking is not allowed. 



1.3 New Strategy 

Without a concrete optimality criterion, there is no basis on which to judge 
the performance of the greedy sequential strategy. In general, however, the best 
greedy strategy is not always the best strategy overall. By ignoring strategies 
that involve backtracking, the greedy search may fail to make decisions most 
consistent with the entire dataset. 

This paper presents an alternative strategy for identifying masquerading ses- 
sions. To start, an optimality criterion is defined by specifying a concrete ob- 
jective function. Moreover, each new session is assigned a score indicating how 
likely that session is a masquerading session. Instead of a binary decision, the 
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detector assigns to the new session a probability between 0 and 1 . This new ses- 
sion is then used to update the detector in a continuous fashion. The technique 
presented in this paper is an extension of the naive Bayes classifier used in [9] 
and will be called the self-consistent naive-Bayes classifier. 

Although the self-consistent naive-Bayes classifier is a significant step be- 
yond the simple naive-Bayes classifier used by Maxion and Townsend [9], three 
elements introduced there still apply here: the classification formulation, the 
bag-of-words model, and the naive-Bayes classifier. These three strategies are 
not unique to detecting masquerading sessions, but rather they appear in the 
far more general context of intrusion detection and text classification. Yung [14] 
provides a brief review of these three elements, within the context of identifying 
masquerading sessions. 

2 Theory 

The simple naive-Bayes classifier is typically built on a training set of labeled 
documents. Nigam and his colleagues [10] demonstrated that classification accu- 
racy can be improved significantly by incorporating labeled as well as unlabeled 
documents in the training set. The algorithm for training this new self-consistent 
naive-Bayes classifier can be derived from the formalism of the EM-algorithm. 
Only the simplest version of this extended naive-Bayes classifier is presented 
here. 



2.1 Likelihood from Test Sessions 

Only two distinct classes of sessions are considered, the proper class and the 
masquerading class. The indicator variable Is = 1 exactly when session s is a 
masquerading session. Let 1 — e and e be the prior probabilities that a session is 
proper and masquerading, respectively. Moreover, let pc and p[ be the probability 
of command c in a proper and masquerading session, respectively. 

The log-likelihood Lg of a test session s is simply, up to an additive constant 

C C 

Ls = {1- l^i)(log(l - e) -k ^n^elogpc) + ls(log e -k ^ logp[,), (1) 

C=1 C=1 

where n^c is the total count of command c in session s. 

Assuming that all test sessions are generated independently of each other, the 
cumulative log-likelihood L* after t test sessions is, up to an additive constant 

t c c 

L+='^Ls = wX log(l - e) + ^ log Pc + log e -f- ^ n'l^ logp[,, (2) 

S = 1 c—1 c—1 

where 

t t t t 

wX = '^{l-lg),w'l = Y^ Is, nXc = X! “ ls)«sc, n+c = X! 

S = 1 S=1 S — 1 
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Here w\ and w'_l = t — w\ are the cumulative numbers of proper and masquerad- 
ing sessions, respectively; and are the cumulative counts of command c 
amongst proper sessions and masquerading sessions, respectively, in the t total 
observed test sessions. 

2.2 Likelihood from Training Sessions 

Now let denote the total count of command c among proper sessions in 
the training set. Likewise, let denote the total count of command c among 
masquerading sessions in the training set. Letting r denote a session in the 
training set, 

^ ^ (1 ^r^Tlrci ^ ^ Ir^rc- (4) 

r r 

Assuming that the sessions in the training set are generated independently, 
the log-likelihood of the proper and masquerading sessions in the training 
set is 

c c 

L+ = ^ log Pc + ^+c log Pc- (5) 

C^l C^l 

This log-likelihood is useful for providing initial estimates of Pc and p'^ but 
provides no information about e. 

2.3 Posterior Likelihood 

Rare classes and rare commands may not be properly reflected in the training 
set. To avoid zero estimates, smoothing is typically applied to the maximum- 
likelihood estimators. This smoothing can also be motivated by shrinkage esti- 
mation under Dirichlet priors on the parameters e, Pc, and p'^. 

Here the parameters e, Pc, and p'^ are drawn from known prior distributions 
with fixed parameters, and a simple standard Bayesian analysis is applied. Sup- 
pose that 

(1 - e,e) - Beta(/3,/3'), (6) 

p ^ Dirichlet(a), (7) 

p' ^ Dirichlet (o'), (8) 

where a, a' , /3, and (3' are taken to be known fixed constants specified in advance. 

Then the cumulative posterior log-likelihood L* is, up to an additive constant 

C 

= (/3 - 1 + w\) log(l - e) -k ^ (oc - 1 -k n+c -k n+c) logpc -k 

C^l 

C 

(/?' - 1 -k w'l) log e -k ^ (Oc - 1 + n+c + n'+c) log p'c. (9) 

C=1 

Here w!|_, and n^ci defined in Equation 3, are cumulative quantities 

determined by the t available test sessions; n^|_^ and defined in Equation 4, 
are the fixed quantities determined by the training sessions. 



Using Self-Consistent Naive-Bayes to Detect Masquerades 333 



2.4 Shrinkage Estimators 



Equation 9 gives the cumulative posterior log-likelihood L* after observing t test 
sessions, in addition to the training sessions. Shrinkage estimators are just the 
maximum- likelihood estimators calculated from the posterior log-likelihood 
So cumulative shrinkage estimators e*, p* and p'^ for e, pc and p^ after t > 0 
sessions are 



P' -1 + w'l 
P — \ P' — 1 1 



( 10 ) 



_ Oc - 1 + «+c + «+c 

{av-l + n% + ’ 

.,t ^ Qc - 1 + n+e + n+c 

J2v=i «-^ + + n+v) 



( 11 ) 

( 12 ) 



2.5 Complete Log-Likelihood 

Initially, the training set is used to build the naive-Bayes classifier. As a new 
unidentified session is presented to the classifier, that new session is scored by the 
classifier and used to update the classifier. This scoring and updating procedure 
is repeated until convergence. As each new session is presented, the classifier is 
updated in a self-consistent manner. 

For a session s in the training set, the identity is known. Specifically, 1^ = 0 
for a proper session, and Is = 1 for a masquerading session in the training set. 
For a new unidentified session. Is is a missing variable that must be estimated 
from the available data. 



2.6 EM- Algorithm for Naive-Bayes Model 



The EM-algorithm [1] is an iterative procedure used to calculate the maximum- 
likelihood estimator from the complete log- likelihood. Each iteration of the EM- 
algorithm includes two steps, the expectation step and the maximization step. 

In the expectation step, the indicators Is are replaced by their expectations, 
calculated from the current estimates of model parameters. In the maximization 
step, the new estimates of the model parameters are calculated by maximiz- 
ing the expected log- likelihood. For each stage t, these two-steps are repeated 
through multiple iterations until convergence. Below the iterations of the EM- 
algorithm during a fixed stage t are described. 

Let = (e‘,p‘,p(f) denote the estimate of 9 = {e,Pc,p'P) at stage t, after 
observing the first t test sessions. For each known session r in the training set, 
Ir is a known constant. So and of Equation 4 remain fixed even as 0* 
changes. 

For each unidentified test session s = 1,2, ... ,t, the expectation step esti- 
mates ^[Islcs; 0*] = P{ls = I|cs; 9^) via Bayes rule as 



P(I, = I|c,;0‘) 



P(l, = l;g*)P(c,|l, = l;g*) 
P(c.;0‘) 



(13) 



where 
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P(c,; e^) = P{U = 0; e*)P{cs\ls = 0; 0*) 

+P(1, = l;0*)P(c,|l, = 1;0‘). (14) 

For the expectation step of the current iteration, the estimate 0* from the max- 
imization step of the previous iteration is used for 0*. 

The maximization step calculates updated estimate 6^ of the parameter 0*. 
The usual cumulative shrinkage estimators of Equations 10-12 are used. How- 
ever, the counts w(|_, w'^, n+g, and are now no longer known constants but 
rather are random variables. These random variables are replaced by their esti- 
mates from the expectation step above. In other words, the maximization step of 
the current iteration uses the estimates from the expectation step of the current 
iteration. 



3 Results 

The technique proposed in this paper is tested on a standard dataset previ- 
ously analyzed by other authors using different techniques. The dataset contains 
command logs from 50 users on a UNIX multiuser system. To create artificial 
intrusion sessions, sessions from another 20 users are injected at random into the 
command logs of the 50 known users. The dataset contains a training portion, 
without masquerading sessions. The remaining test portion contains possible 
masquerading sessions from the artificially created intruders. A summary of the 
dataset is provided below, but full details are provided in [11]. 

3.1 Reference Dataset 

For privacy reasons, the command logs were stripped of the command options 
and arguments, leaving only the commands themselves. For each of 70 users, 
the first 15000 recorded commands were kept. Then 50 users were chosen at 
random to serve as the dataset, and the remaining 20 users were excluded, 
except to provide artificial intrusion data. For each user, the sequence of 15000 
commands was divided into 150 sequential blocks, each containing 100 commands 
in sequence. Each artificially created block is treated as a session. Therefore, each 
of the 50 users had 150 sessions of 100 commands. 

The first 50 sessions of the 50 users are assumed to be free of intruders. These 
50 sessions for 50 users constitute the training set. The remaining 100 sessions 
for each of the 50 users form the test set. Sessions from the 20 excluded users 
are injected into the test session of the 50 known users. The extra sessions in 
the test sets are then removed, leaving just 100 test sessions for each of the 50 
known users. 

To prevent over-training on the data creation process, the exact details of 
how the test set is created cannot be used in the intrusion detection procedure. 
Essentially, once an intruder has broken into a user account, he is likely to return 
on the very next session. A proper session is followed by an intruder session with 
0.01 probability. A masquerading session is followed by a masquerading session 
from the same intruder with 0.80 probability. A chain of masquerading sessions 
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is taken from the intruder in a contiguous sequence. Out of the total 5000 test 
sessions, there were 231 total masquerading sessions, and the remaining 4769 
were proper sessions. These aspects of the data creation are not used as part of 
the classification procedure. 



3.2 Experimental Design 

The details for one single experiment are described below. The example shows 
how masquerading sessions on User 12 were detected. For other users, the same 
analysis applies. 

The full training set was divided into two classes. The proper class consisted 
of 50 sessions for User 12. The masquerade class consisted of the 50 sessions for 
each of the other 49 users. So the masquerade class was built on a total of 2450 
sessions. 

The test set for user 12 had 100 sessions, all purportedly proper sessions. 
The classifier for user 12 was then run against each of the 100 sessions in a 
sequential fashion. As a new session is added, the scores of the previous sessions 
are also updated. At the end of the 100 sessions, the final set of all 100 scores 
were recorded. 

The ROC curve for user 12 was generated by thresholding against the final 
set of scores, after all test sessions were presented sequentially. In this case, a 
session’s score is the probability that the session was generated from the proper 
profile. Therefore, sessions with lower scores were flagged before sessions with 
higher scores. 



3.3 Classifier Settings 

A self-consistent naive-Bayes classifier was built for each individual user. The 
algorithm is outlined in Subsection 2.6. The initial naive-Bayes was built on the 
training set. As each new session is presented, that new session is first scored 
by the previous classifier. Afterwards, all the unidentified sessions are used to 
construct the self-consistent naive-Bayes. This procedure is repeated for each 
new session. 

In the full self-consistent naive-Bayes classifier presented in Section 2, the 
fraction e of masquerading sessions was adjusted in each iteration. In the results 
presented here, the fraction e = 0.50 was held fixed and was not estimated as 
part of the EM-algorithm. Keeping the prior weight (1 — e, e) fixed at (0.50, 0.50) 
allowed for faster convergence and did not adversely affect the final results. 

For each new session, the EM-algorithm was used to adjust the fractions Pc 
and The Dirichlet parameters Oc = 1-01 and = 1.01 for all c G C were 
chosen to match the parameters used in [9] . The EM-algorithm was iterated until 
the change between iteration i and i -I- 1, measured by the quantity — 

I P+ 1 I p averaged over the total number of estimated parameters, 

was less than some tolerance, set to be 2.2 x 10“^®. In practice, the algorithm 
was not sensitive to the precise tolerance. 
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3.4 Composite ROC Curves 

A separate classifier was created for each user and used to score that user’s 
test sessions. The scores of test sessions from one user can be compared because 
those scores were assigned by that user’s classifier. However, scores from different 
users cannot easily be compared because those scores are assigned from different 
classifiers. 

To evaluate a strategy’s success over the entire dataset of 50 users, a com- 
posite ROC curve can be useful. There are several common ways to integrate 
the individual 50 ROC curves into one single curve, but all methods are at best 
arbitrary. 

In this paper, the scores from the 50 different classifiers were taken at face 
value. The 5000 total test sessions from the 50 user were sorted, and sessions 
with the lowest scores were flagged first. Because the scores were probability 
values, this global thresholding strategy has some merit. This method for con- 
structing the composite ROC curve was also used in [9] and [11]. Because the 
same methodology was used in these previous papers, the composite ROC curves 
can be meaningfully compared. 

3.5 Experimental Results 

Two ROC curves are used to compare the self-consistent naive-Bayes classifier to 
the one-step discrete adaptive naive-Bayes classifier of [9] . These curves together 
demonstrate that the self-consistent naive-Bayes classifier offers significant im- 
provements. All results reported here are based on offline evaluation, after all the 
sessions have been presented sequentially. Online evaluation and deferral policies 
are discussed in a separate paper. 

Figure 1 shows in fine detail the composite ROC curves of all 50 users, for 
false-alarm rate 0.00-0.10. Indeed, the self-consistent naive-Bayes outperforms 
the adaptive naive-Bayes uniformly for all but the small portion of false-alarm 
rate 0.00-0.01. In particular, for false-alarm rate 0.013, the self-consistent naive- 
Bayes reduces the missing-alarm rate of the adaptive naive-Bayes by 40%. 

Figure 2 shows the ROC curve for user 12 alone. As noted by the other 
authors [11,9], user 12 is a challenging case because the masquerading sessions 
in the test set appear similar to the proper sessions. On user 12, the adaptive 
naive-Bayes performs worse than random guessing, but self-consistent naive- 
Bayes encounters little difficulty. In fact, the 50 ROC curves over each individual 
user show that self-consistent naive-Bayes typically outperforms adaptive naive- 
Bayes. 

4 Discussion 

The self-consistent naive-Bayes classifier outperforms the adaptive naive-Bayes 
classifier in all practical circumstances, where only low false-alarm rates can be 
tolerated. For false-alarm rate 0.013, the self-consistent naive-Bayes classifier 
lowers the missing-alarm rate by 40%. 



missing alarm ^ missing-alarm rate 
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ig. 1. ROC curve over users 01-50, less than 0.10 false-alarm rate. 
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Fig. 2. ROC curve user 12 
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4.1 Long-Term Scenario 

Unlike the adaptive naive-Bayes, the self-consistent naive-Bayes is not forced 
to make a binary decision as each new session is presented. This advantage 
will become more dramatic as the number of users and the number of sessions 
increase. As the classifier learns from more cases, mistakes from earlier decisions 
are propagated onwards and magnified. Because the adaptive naive-Bayes is 
forced to make a binary decision immediately, it is also more likely to make 
errors. In a long-term scenario, the improvements of the self-consistent naive- 
Bayes classifier are expected to become far more pronounced. 



4.2 Computation Time 

Although the self-consistent naive-Bayes classifier offers a great number of ad- 
vantages, more computation is required. The iterative procedure used to calcu- 
late the probabilities and to update the profile must be run until convergence. 
This additional effort allows the self-consistent naive-Bayes classifier to assign 
scores in a non-greedy fashion. Naturally, searching through the space of models 
requires more effort than the simple greedy approach. 

In practice, convergence is often quick because most sessions have probability 
scores at the extremes, near 0 or 1. In the context of sequential presentation, only 
one single session is presented at a time. Computing the score of a single session 
requires little additional effort. Moreover, the set of scores from one stage can be 
used as the starting point to calculate scores for the next stage. This incremental 
updating approach relieves the potential computational burden. 



4.3 Nonstationary User Behavior 

For some users, the sessions in the test set differ significantly from the sessions 
in the training set. Because user behavior changes unpredictably with time, 
a detector based on the old behavior uncovered in the training set can raise 
false alarms. This high rate of false alarms can render any detector impractical, 
because there are not enough resources to investigate these cases. 

In certain real world applications, however, identifying changed behavior may 
well be useful because a proper user can misuse his own account privileges for de- 
viant and illicit purposes [8] . If the detector is asked to detect all large deviations 
in behavior, then flagging unusual sessions from the proper user is acceptable. 
This redefinition is especially useful to prevent a user from abusing his own 
privileges. 



4.4 Ftiture Directions 

Only the simplest model of user sessions has been presented. In [10], EM- 
algorithm was applied to more elaborate and realistic models. These extensions 
offer much potential, and future work will explore these possibilities. 
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The iterative nature of the EM-algorithm is quite intuitive. In fact, many 
instantiations [2,5,6,13] of the EM-algorithm existed well before the general EM- 
algorithm was formulated in [1]. Moreover, the self-consistency paradigm can be 
applied even to classifiers not based on a likelihood model. A self-consistent 
version can be constructed for potentially any classifier, in the same way that a 
classifier can be updated by including the tested instance as part of the modified 
training set. 

The naive-Bayes classifier relies only on command frequencies. As a user’s 
behavior changes over time, the profile built from past sessions become outdated. 
An exponential-weighing process can be applied to counts from past sessions. For 
the naive-Bayes classifier, this reweighting of counts is especially simple. Such an 
extension becomes even more valuable in realistic scenarios, where a sequential 
classifier is used in an online context for a long time. 

5 Summary 

In previous approaches to detecting masquerading sessions, the detector was 
forced to make a binary decision about the identity of a new session. The self- 
consistent naive-Bayes does not make a binary decision but rather estimates the 
probability of a session being a masquerading session. Moreover, past decisions 
can be adjusted to accommodate newer sessions. Experiments prove that this 
sensible extension markedly improves over more restrictive adaptive approaches, 
by reducing the missing-alarm rate by 40%. 

The self-consistent naive-Bayes classifier extends the usual naive-Bayes clas- 
sifier by taking advantage of information from unlabeled instances. New instance 
are assigned probabilistic labels, and then the profiles are updated in accordance 
with the assigned probabilities of the new instances. This procedure is iterated 
until convergence to a final set of probabilities which are consistent with the 
updated profile. 

By its very nature, the self-consistent naive-Bayes classifier is adaptive to 
the new sessions. Moreover, information from new sessions is also used to adjust 
scores of previous new sessions. In this way, the scores of sessions are assigned 
in a self-consistent manner. As a specific instance of the EM-algorithm, the 
self-consistent naive-Bayes classifier finds a model optimal with respect to its 
likelihood given the available data. 
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Abstract. In contrast to mining over transactional data, graph mining is done 
over structured data represented in the form of a graph. Data having structural 
relationships lends itself to graph mining. Subdue is one of the early main 
memory graph mining algorithms that detects the best substructure that 
compresses a graph using the minimum description length principle. Database 
approach to graph mining presented in this paper overcomes the problems - 
performance and scalability - inherent to main memory algorithms. The focus 
of this paper is the development of graph mining algorithms (specifically 
Subdue) using SQL and stored procedures in a Relational database 
environment. We have not only shown how the Subdue class of algorithms can 
be translated to SQL-based algorithms, but also demonstrated that scalability 
can be achieved without sacrificing performance. 



1 Introduction 

Database mining has been a topic of research for quite some time [1-5]. Most of the 
work in database mining has concentrated on discovering association rules from 
transactional data represented as (binary) relations. The ability to mine over graphs is 
important, as graphs are capable of representing complex relationships. Graph mining 
uses the natural structure of the application domain and mines directly over that 
structure (unlike others where the problem has to be mapped to transactions or other 
representations). Graphs can be used to represent structural relationships in many 
domains (web topology, protein structures, chemical compounds, relationship of 
related transactions for detecting fraud or money laundering, etc). Subdue is a mining 
approach that works on a graph representation. Subdue uses the principle of Minimum 
description length (or MDL) to evaluate the substructures. The MDL principle has 
been used for decision tree induction [8], pattern discovery in bio-sequences [9], 
image processing [6], and concept learning from relational data. 

Main memory data mining algorithms typically face two problems with respect to 
scalability. Graphs could be larger than available main memory and hence cannot be 
loaded into main memory or the computation space required at runtime exceeds the 
available main memory. Although Subdue provides us with a tool for mining 
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interesting and repetitive substructures within the data, it is a main memory algorithm. 
The algorithm constructs the entire graph as an adjacency matrix in main memory and 
then mines by iteratively expanding each vertex into larger subgraphs. 

The focus of this paper is on the development of graph mining algorithms 
(specifically Subdue class of algorithms) using SQL and stored procedures using a 
Relational database management system (RDBMS). Representation of a graph, and its 
manipulation - generation of larger subgraphs, checking for exact and inexact 
matches of subgraphs - are not straightforward in SQL. They have to be cast into 
join-based operations and at the same time avoid manipulations that are inefficient 
(correlated subqueries, cursors on large relations, in-place inserts and deletes from a 
relation). This paper elaborates on a suite of algorithms that have been carefully 
designed to achieve functionality as well as scalability. We have not only shown how 
graph mining can be mapped to relations and operations on them, but we have also 
demonstrated that scalability can be achieved without sacrificing performance. 

Section 2 provides an overview of the main memory algorithm Subdue and 
includes related work. Section 3 describes graph representation and how the 
substructures are represented and generated. Section 4 presents the relational 
approach to graph mining. Section 5 presents performance evaluation including 
comparison with the main memory algorithm. Conclusions and future work are 
presented in section 6. 





Fig. 1. Input Graph 
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Fig. 2. Input file for Subdue 



2 Related Work 

In this section we briefly describe the working of the subdue algorithm and the 
parameters associated with it. Subdue represents the data as a labeled graph. Objects 
in the data are represented as either vertices or as small subgraphs and the 
relationships between the objects are represented as edges. For the graph shown in 
Fig. 1, the input to Subdue is a file as shown in Fig. 2, which describes the graph. 
Each vertex has a unique number, and a label (not necessarily unique). Each edge has 
a label (not necessarily unique) and the numbers of the vertices that it connects - from 
source to destination. The edge can be undirected or directed. Each substructure is 
expanded in all possible ways, by adding an edge and a vertex to the instance or just 
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an edge if both the vertices are already present in the instance. Some of the instances, 
which were expanded in a different way but match the substructure within a threshold 
using the inexact graph match, are also included in the instances of that substructure. 
Each of these substructures is evaluated using the MDL heuristic and only a limited 
number of substructures (specified as beam) are selected for future expansion. Beam 
represents the maximum number of substructures kept in the substructure list to be 
expanded. The algorithm’s run time is bounded by the user specified beam width, and 
the Limit, which is the total number of substructures considered by the algorithm. 

The output for Subdue for the input file in Fig. 2 is a list of substructures. The best 
substructure returned by Subdue is A ^ B with two instances. The description length 
as calculated by MDL and the compression achieved by this substructure are also 
output. Since Subdue executes in main memory and keeps the entire graph in main 
memory, its performance degrades for larger data sets. 

Frequent Subgraphs or FSG is another approach [10] to graph mining. FSG is used 
to discover subgraphs that occur frequently in a data set of graphs. FSG uses a 
mechanism called canonical labeling [12] for graph comparison. The heuristic used 
by FSG for evaluating the substructures is count, which is the total number of 
substructures available in the whole dataset of graphs. In the candidate generation 
stage, joining two frequent k size substructures generates the k+1 size substructure 
candidates. For two k size frequent substructures to be joined they must have the same 
core, which means they must have a common k - 1 size subgraph. 
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Fig. 3. Tuples in the edges relation Fig. 4. Tuples in the vertices relation 



3 Graph Representation and Generation of Substructures 

This section describes how graphs are represented in a database. Since databases have 
only relations, we need to convert the graph into tuples in a relation. The vertices in 
the graph are inserted into a relation called Vertices and the edges are inserted into a 
relation called Edges. The input is read from a delimited ASCII file and loaded into 
the relations. For the graph shown in Fig. 1, the corresponding vertices and the edges 
relations are shown in Fig. 3 and Fig. 4. The joined_l relation will consist of all the 
substructures of size one - size representing the number of edges. The new relation 
joined_l is created because the edges relation does not contain information about the 
vertex labels. So the edges relation and the vertices relation are joined to get the 
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joined_l relation. The resultant joined_l relation is shown in Fig. 5. For a single edge 
substructure, the edge direction is always from the first vertex to the second vertex. 

For a higher edge substructure, we need to know the directionality of the edges 
between the vertices of the substructure. In case of a two-edge substructure, the third 
vertex can either be expanded from the first vertex or the second vertex. Therefore a 
new attribute called the extension attribute (extn) has been introduced for each 
expansion of the substructure. For example, if the 5^6 substructure in Fig. 1 is 
extended to get the 5 ^ 6 ^ 7 substructure then the extension attribute will have 
value 2, indicating that the third vertex is extended from the second vertex. The edge 
direction can be either into the extended vertex or away from the extended vertex. 
When the edge comes into the extended vertex, it is represented by a negative value in 
the extension attribute. For example, if 1 ^ 2 substructure in Fig. 1 is expanded to 1 
^ 2 <- 3 then the extension attribute will have value -2, indicating that the third 
vertex is expanded from the second vertex and the edge is going towards the second 
vertex. In general if the extension i (attribute ext) is -j, then the edge i-tl is from 
vertex i-t-2 to j. If the extension i is j, then the edge i-tl is from vertex j to i+2. For a 
substructure of size n, we need n-1 extension attributes. 
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Fig. 5. Joined_l relation 



4 Approach 

The subgraph discovery process using relations uses the above representation for the 
graph. We start with a one-edge substructure and compute a count for each tuple (or 
substructure) to indicate the number of substructures it is isomorphic to (either exact 
match or an inexact match using a threshold). This is done using the Group by clause 
in DB2 [11] to update the count (number of instances) of each substructure. Based on 
count as a heuristic, the substructures are ranked. Substructures are then expanded to 
three edge substructures and the process is repeated until we find the best n sized 
substructures. 



4.1 Algorithm 

The pseudocode for this algorithm is given below: 

Subdue-DB (input file, size) 

Load vertices into vertices relation; 

Load edges into edges relation; 

Load joined_base relation and joined_l relation by joining 
vertices and edges relation 
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i = 2 

WHILE (i < size) 

Load joined_i relation 

(substructures of size i) 
from beam_j oined_i-l , joined_base 
create frequent_i relation 

DECLARE Cursor cl on frequent_i order by count 
DECLARE Cursor c2 on frequent_i 
WHILE (cl. count < beam) 

EETCH cl into gl 
WHILE (c2) 

FETCH c2 into g2 

If (! Isomorphic (gl, g2)=0) 

Insert cl into f requent_beam_i 
Insert into beam_j oined_i 
from frequent_beam_i , j oined_i 

i + + 



4.2 Substructure Discovery 

The substructure discovery algorithm starts with one-edge substructures unlike in 
Subdue, which starts with all the unique vertex labels. In the database version, each 
instance of the substructure is represented as a tuple in the relation. The algorithm 
starts with initializing the vertices and the edges relation. The joined_base relation is 
then constructed by joining the vertices and edges relation. The joined_base relation 
will be used for future substructure expansion. The joined_l relation is a copy of the 
joined_base relation. The frequent_l relation is created to keep track of substructures 
of size 1 and their counts. Here, size refers to the number of edges as opposed to 
Subdue main memory algorithm in which size refers to the number of vertices. 
Projecting the vertex labels and edge labels attributes on the joined_l relation and 
then grouping them on the same attributes, produces the count for the frequent_l 
relation as shown in Fig. 6. 
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Fig. 6. Frequent_l relation 



Therefore the frequent_l relation does not have vertex numbers. Group by clause 
is used so that all the exact instances of the substructure are grouped as one tuple with 
their count updated. 

The pruning of substructures with a single instance corresponds to deleting the 
tuples with count value 1 from the frequent_l relation. Since these substructures do 
not contribute to larger repeated substructures, they are deleted from the joined_base 
relation as well, so that further expansions using that relation do not produce 
substructures with single instances (there may be 1 instance substructures produced 
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later when repeated substructures of length 1 do not grow to repeated substructures of 
length i + 1). This is based on the property that the number of instances of a larger 
substructure cannot be more than the number of instances of a smaller structure 
embedded in it. This allows for pruning of substructures that will not produce larger 
substructures of significance. This property is similar to the subset property used in 
association rule mining. 
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Fig. 7. Updated joined_base relation 

The updated frequent_l relation contains first two rows of frequent_l relation 
shown in Fig. 6. and the resultant joined_base relation is shown in Fig. 7. The 
substructures are then sorted in decreasing order of the count attribute and the beam 
number of substructures is inserted into another relation called the frequent_beam_l 
relation. Only the substructures present in this relation are expanded to larger 
substructures. Since all instances of the substructure are not present in the 
frequent_beam_l relation, this relation is joined with the joined_l relation to 
construct the joined_beam_l relation. The joined_beam_l is in turn joined with the 
joined_base relation to generate the two edge substructures. 



4.3 Extending to Fligher-Edge Substructures 

In the main memory approach, every substructure, which is necessarily a subgraph , is 
defined as a structure in the programming language. Extensions to two or more edges 
are generated by growing the substructure appropriately. In the database 
representation, as there are no structures, the only information to be used will be the 
single edge substructures, which are basically tuples in the joined_base relation. The 
number of attributes of the relation needs to be increased to capture substructures of 
increased size. The joined_beam_l relation contains all the instances of the single 
edge substructure. Each single edge substructure can be expanded to a two-edge 
substructure on any of the two vertices in the edge. In general, an n edge substructure 
can be expanded on n H- 1 vertices in the substructure. So by making a join with the 
joined_base relation we can always extend a given substructure by one edge. In order 
to make an extension, one of the vertices in the substructure has to match a vertex in 
the joined_base relation. The resultant tuples are stored in the joined_2 relation. The 
following query extends a single edge substructure in all possible ways. 

Insert into j oined_2 (vertexl , vertex2 , vertexS , 

vertexlname , vertex2name , vertex3name , 
edge 1 name , edge 2 name , extl ) 

(select j 1 . vertexl , j 1 . vertex2 , j2 . vertex2 , 

j 1 .vertexlname, j 1 . dgelname , j 2 . edgelname, 1 
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from joined_l jl, joined_base j2 
where j 1 . vertexl= j 2 . vertexl and 
j 1 . vertex2 ! = j 2 . vertex2 

Union 

Select j 1 .vertexl , j 1 . vertex2 , j 2 . vertex2 , 
j 1 . vertexlname , j 1 . ver tex2name , 
j 2 . vertex2name , j 1 . edgelname , j2 . edgelname, 2 
from joined_l jl, joined_base j2 
where jl.vertex2 = j 2. vertexl 
Union 

Select j. vertexl, j .vertex2 , j 1 .vertexl , 

j . vertexlname , j . ver tex2name , j 1 . vertexlname , 
j . edgel , j 1 . edgel , -2 
from joined_l j, joined_base jl 
where j .vertex2 = jl.vertex2 and 
j .vertexl != jl. vertexl) 

In the above there are 3 queries (and 2 unions). Since there are 2 nodes (in a one- 
edge substructure), the first query corresponds to the positive extension of the first 
node, and the other two correspond to the positive and negative extensions of the 
second node. In general, for a substructure with n nodes, there will be (n-l)*2 H-1 
queries and (n-l)*2 unions. The reason for not performing the negative extension on 
the first node is that it will be covered by the positive expansion of the node whose 
edge is coming into it. Also, every node is a starting point for expanding it to a larger 
size substructure. Hence the incoming edges of the starting node need not be 
considered. The rest of the steps are the same as the single edge substructure 
expansion, except for the addition of a new extension attribute. The n-1 edge 
substructures are stored in the joined_beam_n- 1 relation. In order to expand to n edge 
substructures, an edge is added to the existing n-1 edge substructure. Therefore, the 
joined_beam_n-l relation is joined with the joined_base relation to add an edge to the 
substructure. The frequent_n relation is in turn generated from the joined_n relation. 
The frequent_beam_n relation contains the beam tuples. The joined_beam_n is then 
generated which has only the instances of the beam tuples. The main halting condition 
for the algorithm is the user-specified parameter - the max size (limit). Once the 
algorithm discovers all the substructures of the max size the program terminates. 




Fig. 8. Embedded Substructure 
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5 Performance Evaluation of Implementation Alternatives 

This section discusses the performance comparison between DB-Subdue and Sudbue 
main memory algorithm for various datasets. Several optimizations were done before 
the final approach was implemented. This process was very useful in understanding 
the effect of various types of query formulations (correlated, using the except or 
minus operator, use of indexes, etc.). We use DB2 6.1, running on a SUNW, Vltra- 
5_10 5.6 with 348MB of RAM. The embedded substructure in the graph on which the 
performance comparison will be made is shown in Fig. 8. Each experiment was 
performed 4 times and the fist result was discarded (so that cold start is avoided). The 
numbers presented show the average of the next 3 runs. The data set is encoded, as 
TnVmE where n is the number of vertices in the graph and m is the number of edges 
in the graph. 

The graphical comparisons between subdue and DB-Subdue is shown in Eig. 9 and 
Eig. 10 for beam value of 4 and 10, respectively. The timings are shown on the 
logarithmic scale, as the range is very large. The final timings and their comparison 
with Subdue are shown. The numbers of instances of the substructures discovered by 
all the algorithms are the same. DB-Subdue approach was applied to graphs that had 
800K vertices and 1600K edges. The main memory algorithm would not go beyond 
50K vertices and lOOK edges. The performance crossover point for the algorithm 
takes place at as low as 100 edges in the graph (which was a surprise!). The main 
memory algorithm took more than 60,000 seconds to initialize the T400KV1600KE 
graph and took more than 20 hours to initialize the T800KV1600KE data set. Testing 
was not done beyond the T800KV1600KE data set for the database approach because 
the graph generator could not produce the required graph. We also ran the same data 
set with beam size set to 10. The run time for the data set T800KV1600KE using the 
database approach with a beam size of 10 is 12,703 seconds. 

The time taken by the Subdue algorithm for a data set of T50KV100KE and a 
beam size of 10 is 71,192 seconds. One of the main reasons for such improvement has 
been using pure SQL statements to achieve the functionality. The database approach 
is also insensitive to the beam size as compared to Subdue. Eor the largest data set, 
with beam=4, it takes 34,259 seconds and with beam=10 it takes 71,192 seconds. The 
time taken when the beam is increased does not increase with the same rate with 
which the Subdue algorithm increases. This is a significant improvement from 
scalability viewpoint. Also, the absolute numbers are not important here, as the 
experiments have been done on an old machine. 

It took us several iterations before arriving at the final version of the algorithm. 
Initial naive versions of DB-subdue performed worse than the main memory 
algorithm. We eliminated correlated subqueries and used the minus operator, changed 
the approach from deleting the tuples (which were large in number) to retaining the 
tuples that we wanted to expand (small in number). The performance using the above 
optimizations have resulted in orders of magnitude of difference in computation time 
allowing us to scale to very large data sets. 
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Fig. 9. Graphical comparison of the approaches 
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Fig. 10. Graphical comparison of the approaches 



6 Conclusion and Future Work 

We believe that this is the first attempt to mine substructures over graphs using the 
database approach. The idea behind graph-based data mining is to find interesting and 
repetitive substructures in a graph. One of the main challenges was the representation 
and generation of substructures in a database. The next challenge was to optimize the 
approach to achieve performance superiority over the main memory version. Finally, 
we were able to run graph mining on data sets that have 800K vertices and 1600K 
edges to achieve the desired functionality and scalability. 

Currently, DB-Subdue is being extended in several ways: handle cycles in a graph, 
to include the concept of overlap among substructures (where a smaller subgraph is 
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common between two subgraphs), inexact graph match and develop the equivalent of 
MDL for the database approach. 
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Abstract. Many electronic documents such as SGML/HTML/XML 
files and LaTeX files have tree structures. Such documents are called tree- 
structured documents. Many tree-structured documents contain large 
plain texts. In order to extract structural features among words from 
tree-structured documents, we consider the problem of finding frequent 
structured patterns among words in tree-structured documents. Let 
fc > 2 be an integer and (Wi, W 2 , ■ ■ ■ , Wk) a list of words which are sorted 
in lexicographical order. A consecutive path pattern on (VFi, W 2 , ■ ■ ■ , Wk) 
is a sequence (ti; t 2 ; . . . ; tk-i) of labeled rooted ordered trees such that, 
for i = 1,2, ... ,k — 1, {!) ti consists of only one node having the pair 
(Wi,Wi+i) as its label, or (2) ti has just two nodes whose degrees are 
one and which are labeled with Wi and Wi+i, respectively. We present a 
data mining algorithm for finding all frequent consecutive path patterns 
in tree-structured documents. Then, by reporting experimental results 
on our algorithm, we show that our algorithm is efficient for extracting 
structural features from tree-structured documents. 



1 Introduction 

Background: Many electronic documents such as SGML/HTML/XML files 
and LaTeX files have tree structures which are no rigid structure. Such doc- 
uments are called tree-structured documents. Since many tree-structured doc- 
uments contain large plain texts, we focus on the characteristics such as the 
usage of words and the structural relations among words in tree-structured doc- 
uments. The aim of this paper is to present an efficient data mining technique 
for extracting interesting structures among words in tree-structured documents. 

Data model and consecutive path pattern: As a data model for tree- 

structured documents, we use an Object Exchange Model (OEM, for short) 
presented by Abiteboul et al. in [1]. As usual, a tree-structured document is 
represented by a labeled ordered rooted tree, which is called an OEM tree. As a 
value, each node of an OEM tree has a string such as a tag in HTML/XML files, 
or a text such as a text written in the field of PCDATA in XML files. Moreover, 
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(REUTERS) 

(DATE) 26-FEB-1987 (/DATE) 

(TOPICS) (D) cocoa (/D) (/TOPICS) 

(TITLE) BAHIA COCOA REVIEW (/TITLE) 
(DATELINE) SALVADOR, Feb 26 - (/DATELINE) 
(BODY) 

Showers continued throughout the week 
in the Behia cocoa zone, ... 

(/BODY) 

(/REUTERS) 

xmLsample 




T 



Fig. 1. An XML file xmLsample and the OEM tree T of xmLsample. 



an internal node has children which are ordered. For example, we give an XML 
file xmlsample and its OEM tree T in Fig. 1. In T, the node 3 has “TOPICS” 
as a value and is the 2nd child of the node 1 . 

Many tree-structured documents have no absolute schema fixed in advance, 
and their structures may be irregular or incomplete. The formalization of rep- 
resenting knowledge is important for finding useful knowledge. For an integer 
k > 2, let (LFi, W 2 , . . . , Wk) be a list of k words appearing in a given set of 
tree-structured documents such that words are sorted in ASCII-based lexico- 
graphical order. Then, a consecutive path pattern on {Wi, W 2 , ■ ■ ■ , Wk) (a 
CPP, for short) is a sequence (fi; ^ 2 ; ■ • ■ ; tk-i) of labeled ordered rooted trees 
such that, for i = I, 2, . . . , fe — 1, (I) ti consists of only one node having the 
pair (Wi, as its label, or (2) ti has just two nodes whose degrees are one 

and which are labeled with Wi and Wi+i, respectively. A CPP a = (ti; . . . ; tfc) 
is said to appear in the OEM tree Td of a given tree-structured document d, if 
there exists a sequence (s^, . . . , s^) of subtrees s^, . . . , of Td satisfying 

the following conditions (1) and (2). (I) For each I < i < /c, is isomorphic 
to 2-tree U in a. (2) For each I < j < fc — 1, the righthand leaf of and the 
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Fig. 2. 2-trees ti, t 2 , ta and 



lefthand leaf of are same in Td- For example, the sequences a = 
and /3 = (ti; ^ 2 ; H) consisting of 2-trees ti, ^ 2 , ts and given in Fig. 2 are CPPs 
on (BAHIA, SALVADOR, cocoa, week). We can see that a appears in T but /3 does 
not appear in T, where T is the labeled rooted tree given in Fig. 1. 

Main results: We consider the problem for finding all CPPs which appear in 
a given set of tree-structured documents in a frequency of more than a user- 
specified threshold, which is called a minimum support. For this problem, we 
present an efficient algorithm which is based on level-wise search strategy with 
respect to length of a CPP such as Apriori algorithm [2] in making new CPPs 
from founded CPPs. Next, in order to show the efficiency of our algorithm, 
we apply our algorithm to Reuters news-wires [9], which contain 21,578 SGML 
documents and whose total size is 28MB. By reporting experimental results on 
our algorithm, we show that our algorithm have good performance for a set of 
a large number of tree-structured documents. 

Related works: In [6], we presented the algorithm for extracting all frequent 
CPPs from a given set of tree-structured documents. This algorithm was de- 
signed under the following idea. If a set / of words is not a cr-frequent itemset 
w.r.t D, then any CPP, which contains all words in /, is not frequent. This idea 
leads us to reduce the size of search space of possible representative CPPs. How- 
ever, when there are many high frequent and large itemsets in tree-structured 
documents, this idea does not work well for pruning a search space. Hence, in 
this paper, we present a new efficient algorithm, which is based on level wise 
search strategy, for solving Frequent CPP Problem. 

As a data mining method for unstructured texts, Fujino et al. [5] presented an 
efficient algorithm for finding all frequent phrase association patterns in a large 
collection of unstructured texts, where a phrase association pattern is a set of 
consecutive sequences of keywords which appear together in a document. In order 
to discover frequent schema from semistructured data, several data mining meth- 
ods were proposed. For example, Wang and Liu [11] presented an algorithm for 
finding frequent tree-expression patterns and Fernandez and Suciu [4] presented 
an algorithm for finding optimal regular path expressions from semistructured 
data. In [3], Asai et al. presented an efficient algorithm for discovering frequent 
substructures from a large collection of semistructured data. In [10], Miyahara 
et al. presented an algorithm for extracting all frequent maximally structural 
relations between substructures from tree-structured documents. 
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In order to discover characteristic schema, many researches including above 
results focused on tags and their structured relations appearing in semistructured 
data. But in this paper, we are more interested in words and their structural rela- 
tions rather than tags in tree-structured documents. The algorithm given in this 
paper is useful for avoiding inappropriate documents and searching interesting 
documents for users. 

2 Preliminaries 

In this section, in order to represent structural features among words in tree- 
structured documents, we formally define a consecutive path pattern as con- 
secutive paths between words. In this paper, we deal with only tree-structured 
documents. Therefore, we simply call a tree-structured document a document. 

An alphabet is a set of finite symbols and is denoted by S. We assume that 
S includes the space symbol “ A finite sequence (oi, 02, . . . , fln) of symbols 
in S is called a string and it is denoted by aitt2 • • • a„ for short. A word is a 
substring 0203 • • • a„-i of 0102 • ■ • a„ over S such that both ai and a„ are space 
symbols and each ai {i = 2,3 , ... ,n — 1) is a symbol in S which is not the 
space symbol. Let Td be the OEM tree of a document d. For a word w, JSd{w) 
denotes the set of all nodes of Td whose value contains w as a word. Moreover, 
let D = {di, c?2, ■ • ■ , dm} be a set of documents and Ti {i = 1, 2 , . . . , to) the 
OEM tree of the document di in D. For a word w, JSd{w) = Ui<i<m 
is called a appearance node set of w in D. For a set or a list S, the number of 
elements of S is denoted by #5'. 

Let W and W' be two words over S. We call an ordered rooted tree t a 2-tree 
over the pair (IF, W) if (1) t consists of only one node labeled with (W, W') or 
(2) t has just two nodes whose degrees are one and which are labeled with W 
and IF', respectively. For a 2-tree t, the lefthand node and the righthand node of 
t whose degrees are one, are denoted by ln(t) and rn(t), respectively. If t consists 
of only one node, we remark that ln(t) = rn(t). For a 2-tree t and a document 
d, a matching function of t for Td is any function tt : Ft — >■ Vr^ that satisfies the 
following conditions (l)-(3), where Td is the OEM tree of d, and Ft and Vra are 
the node sets of t and Td, respectively. 

(1) 7T is a one-to-one mapping. That is, for any vi,V2 € Ft, if vi yf V2 then 
7 t(vi) yf 7 t(v 2). 

(2) 7T preserves the parent-child relation. That is, for the edge sets At and 
of t and Td, respectively, (vi,V2) € Et if and only if (7r('i;i), 7r(r;2)) € Etj,- 

(3) If t has two leaves, the label of each node v G Vt is contained in the value of 
the node n(v) in Vr^ as a word. If t consists of only one node v, two words 
in the label of v appear individually in the value of tt(v) in Ft^. 

Moreover, a pseudo-matching function from t to Td is any function tp :Vt ^ Fp^ 
that satisfies the above conditions (1) and (2) and the following condition (3’). 

(3’) If t has two leaves, the label of each leaf p € Ft is contained in the value 
of hr Vt^ as a word. If t consists of only one node v, two words in the 
label of V are appeared individually in the value of ip{v) in Fpj. 
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That is, in the definition of a pseudo-matching function, we ignore a label of any 
inner node in t. A 2-tree t is said to be representative if any inner node of t has 
a special symbol, which is not a word, as a label. 

In this paper, we assume a total order over words such as an ASCII-based 
lexicographical order. For two words W and IF', if W is less than W' in this order, 
we denote W < W . Let k be an integer greater than 1 and (IFi, W 2 , . . . , Wk) a 
list of k words IFi, W 2 , . . . , Wk such that Wi < IF+i for 1 < i < fc — 1. Then, 
a list (ti; ^ 2 ; • ■ • ; tk-i) of k — 1 consecutive 2-trees ti, t 2 , ... and tk-i is called a 
consecutive path pattern for (IFi, W 2 , ■ ■ ■ , Wk) if for any 1 < i < fc — 1, the 
i-th 2-tree ti is a 2-tree over {Wi, IF+i)- We remark that, for each \ < i < k — 2, 
both the righthand node rn{ti) of U and the lefthand node of ti+i 

have the same label. A consecutive path pattern {ti, ^ 2 ; ■ • • ; tfc-i) is said to be 
representative if for any 1 < i < fc — 1, the i-th 2-tree ti is a representative 2-tree 
over (Wi, Wi+i). For a CPP a, it is simply called a consecutive path pattern, if 
k words do not need to specify. A consecutive path pattern is shortly called a 
CPP. The number of 2-trees in a CPP a is called a length of a. 

In the same way as a 2-tree, for a consecutive path pattern a = (ti; t 2 ',---, tk) 
and a document d, we can define a matching function tt : Ui<i<fe ^ ^Td ^ 
follows, where T4 is the OEM tree of d, Vxd is the node set of and Vi is the 
node set of for 1 < i < k. 

(1) For any 1 < i < fc, there exists a matching function tt^ : F — >■ such that 

for a node v in ti, tt{v) = TTi{v). 

(2) For any 1 < i < fc — 1, the node rn{ti) in ti and the node ln{ti^i) in ti+i, 
Tr{rn{U)) = 7r{ln{U+i)) 

Moreover, we also define a pseudo-matching function from a representative 
CPP a to Td by interchanging a matching function tt^ and a pseudo-matching 
function ipi. 

For a CPP a = (ti; O; ■ ■ ■ ; Ife) and a document d, the set {(tt, 7r(rn(tfe))) | 
7T is a matching function of a for T^} is called the junction set of a in d and 
is denoted by JSd{a), where Td is the OEM tree of d. If ffJSd{a) > I, 
we say that a appears in d. In the same way, for a representative CPP a' 
and a document d, the junction set JSd{oi') is the set {{-k' ,it' { rn{tk))) \ 
n' is a pseudo-matching function of a' for Td}. If ffJSd(a') > I, we say that 
a' appears in d. For example, four 2-trees t\, t 2 , O and t 4 in Fig. 2 appear in the 
document xrnLsample in Fig. 1. The CPP a = (ti; 0; O) appears in xmLsample. 
Since there is no function tt such that 7r(rn(t2)) = T^{ln{t^)) = ’K{rn{t^)), tt is a 
matching function of t 2 for T in Fig. 1 and is also a matching function of ta for 
T in Fig. 1, the CPP {tp, t 2 \ tf) does not appear in xmLsample in Fig. 1. 

3 Extracting Consecutive Path Patterns from 
Tree-Structured Documents 

In this section, we formally define the data mining problem of finding all frequent 
consecutive path patterns from tree-structured documents. Next, we give an 
efficient algorithm for solving this problem. 
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For a. set D = {di,d 2 , ■ ■ ■ ,dm} of documents, a word w and a CPP a, let 
Occb{w) = I 1 < i < TO, JSdi{w) ^ 0} and let Occo{a) = | 1 < 

i < m, J^Sdi(o) ^ 0}. For a set Z) of to documents and a real number cr 
(0 < tr < 1), a word w and a CPP a are a-frequent w.r.t. D if OccD{yj)j'm > cr 
and Occd{ol) / m > cr, respectively. In general, a real number cr is given by a 
user and is called a minimum support. For a tree-structured document d, W(d) 
denotes the set of all words appearing in d. For a set D of documents, a CPP 
w.r.t D is a CPP on a list of words in W{D) = W{d). Then, we consider 

the following data mining problem. 

Frequent (Representative) CPP Problem 

Instance: A set D of tree-structured documents and a minimum support 

0 < CT < 1. 

Problem: Find all a-frequent (representative) CPPs w.r.t. D. 

In [6], we presented the Apriori based algorithm for solving the Frequent 
Representative CPP Problem. This algorithm was designed under the following 
idea. If a set / of words is not a cr-frequent itemset w.r.t D, then any represen- 
tative CPP, which contains all words in /, is not frequent. This algorithm finds 
all cr-frequent itemsets w.r.t D by regarding a word as an item and the set W(c?) 
of all words appearing in each document c? as a transaction and using a FP-tree 
which is a compact data structure given by Han et al. in [8]. This idea leads 
us to reduce the size of search space of possible representative CPPs. However, 
when there are many high frequent and large itemsets in documents, this idea 
does not work well for pruning a search space. Hence, in this paper, we present 
a new efficient algorithm, which is based on levelwise search strategy, for solving 
Frequent CPP Problem. Moreover, as heuristics of our algorithm for extracting 
interesting patterns from given tree-structured documents and reducing search 
space, a quite frequent word such as “a” or “the” is not used as a word in a 
CPP. Such a word is called a Stop Word. 

In Fig. 3, we present an algorithm Find-Freq.CPP for solving Frequent CPP 
Problem. As input for our algorithm, given a set D of tree-structured documents, 
a minimum support 0 < cr < 1 and a set of Stop Words, our algorithm outputs 
the set of all cr-frequent CPPs w.r.t. D. Our algorithm is based on a strategy 
of level-wise search. A CPP consists of consecutive 2-trees. For any two words 
Wi , W 2 in a document d, there exists just one path from the node of whose 
value contains w\ to the node of Td whose value contains W 2 , where Td is an 
OEM tree of d. Hence, by using junction sets of all frequent CPPs whose length 
is fc > 1 and appearance node sets of all frequent words, we can correctly obtain 
all frequent CPPs whose length is fc -I- 1 and their junction sets from all frequent 
CPPs whose length is k. 

In reading given documents, we construct a trie (see [7]), which is a compact 
data structure, for efficiently storing and searching words and all appearance 
node set of it. By using this structure, the line 1 of our algorithm can be ex- 
ecutable in linear time. Let M and N be the numbers of cr-frequent words in 
D and the extracted cr-frequent CPPs, respectively. Let n be the number of 
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Algorithm Find Freq CPP 

Input: A set D = {di, d 2 , . . . , dm} of tree-structured documents, 

a minimum support 0 < cr < 1 and a set Q of stop words. 

OutputrThe set T of all cr-frequent CPPs w.r.t. D. 

1. Freq^Word := {{w, J'S niw))] \ w € W{D) — Q, w is cr-frequent w.r.t. D}; 

2. F{1) := Make-2Tree{Freq-Word,(j)\ 

3. k ■- 1; 

4. while J-(k) 7 ^ 0 do; 

5. begin 

6 . J-{k + 1] := Expand_Pattern(F(k), FreqJWord, cr); 

7. k k 1\ 

8. end; j* end of while loop */ 

9. return T = Ui<i<fc{“ I («. JSd{o)) £ T[i]}-, 



Fig. 3. The data mining algorithm Find_Freq-CPP 



Procedure Make_2Tree 

Input: A set FreqJWord of pairs consisting of words and the lists of nodes, 

and a minimum support 0 < <t < 1 . 

Output:The set F = \ (t) is a cr-frequent CPP w.r.t.D}. 

1. for each {W,listw) £ FreqJWord do 

2. for each {P, listp) G FreqJWord such that W < P do 

3. trap := 0; 

4. for each node x G listw and each node y G listp do, 

5. search a 2-tree t over the pair {W, P) from x to y, 

6 . if both X and y are in the same tree; 
execute the following instruction ( 1 ) or ( 2 ); 

( 1 ) add y to list if there exists {t, list) in tmp such that y 0 list; 

(2) add (t, y) to tmp if t ^ {s | (s, L) G tmp)\ 

7. end do 

8 . ii {{ij:OccD{{t)))/{#D) > a then F~FVJ {{{t), list) \ (t, list) a tmp}-, 

9. end do; /* end of inner for loop */ 

10 . end do; /* end of outer for loop */ 

11. return F-, 



Fig. 4. The procedure Make-2Tree 



nodes of all OEM trees each of which corresponds to a document in D. Then, 
Make.2Tree and ExpandJ-’attern are executable in 0{n?M^ log nlogM) and 
0{v? NM log nlogM), respectively. Hence, given a set D of tree-structured doc- 
uments, a minimum support 0 < cr < 1 and a set of Stop Words, FindJ'req-CPP 
can find all cr-frequent CPPs in 0(max(n^M^ log nlog kv? MN log nlogN)) 



358 



T. Uchida, T. Mogawa, and Y. Nakamura 



Procedure Expand Pattern 

Input: A set F[k] of pairs of CPPs whose lengths are k and lists of nodes, a set 

FreqJWord of pairs consisting of words and lists of nodes, and a minimum 
support 0 < cr < 1. 

Output:The set F[k + 1] = {{a,JSo{o) \ a is a cr-frequent CPPs w.r.t.Z) 

whose lengths are fc + 1.} 

1. for each {a, JSu{oi)) £ F\k] do 

/* let a = • • • ; tfe) be a CPP over (VPi, IY 2 , • • • , Wk+i) */ 

2. for each {W,listw) G Freq-Words such that Wh+i < W do 

3. tmp := 0; 

4. for each node x G JSo{ct) and each node y € listw do, 

5. search a 2-tree t over the pair {Wk+i, W) from x to y, 

if both X and y are in the same tree; 

6. execute the following instruction (1) or (2); 

(1) add y to list if there exists a pair {t, list) in tmp with y 0 list; 

(2) add {t, y) to tmp if t ^ {s | (s, L) G tmp}; 

7. end do 

8. if {#OccD{{ti;t 2 ; ■ ■ ■ ;tk;t)))/{#D) > a then 

F[k + l]~F[k -I- 1] U {{{ti;t 2 ; ■ ■■ ;tk; t),list) \ {t, list) G tmp}; 

9. end do; /* end of inner for loop */ 

10. end do; /* end of outer for loop */ 

11. return F[k + 1]; 



Fig. 5. The procedure Expand-Pattern 





RunTime 

[sec] 



(a) Find.Freq_CPP 



(b) Prev_Rep_A}gorithm 



Fig. 6. The running times of Find_Freq_CPP and Prev_Rep_AIgorithm 



time, where k is the maximum length of founded CPPs. This shows that if M 
is in polynomial of the input size, Find-Freq-CPP terminates in a polynomial 
time. For lower minimum supports, M and N become to be large numbers. So, 
we need to give an appropriate set of Stop Words as input, when we want to 
gain all frequent CPPs. 
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(c) Find-FreqJRep-CPP 



(d) Prev^Rep^Igorithm 



Fig. 7. The running times of Find-Freq-Rep-CPP and Prev-Rep-Algorithm 



By slightly modifying the above algorithm FindJ'req-CPP, we construct an 
algorithm for solving Frequent Representative CPP Problem. This algorithm is 
denoted by Find FFieqJRep -CPP . 

4 Experimental Results 

In this section, we show the efficiency of our new algorithms Find-Freq-CPP and 
Find-Freq_Rep_CPP given in previous section. We implemented Find-Freq_CPP 
and Find-Freq_Rep-CPP, and our previous algorithms Prev_Algorithm and 
Prev- Rep -Algorithm in [6] which solve Frequent CPP Problem and Frequent 
Representative CPP Problem, respectively. All algorithms are implemented in 
C++ on a PC running Red Hat Linux 8.0 with two 3.2 GHz Pentium 4 proces- 
sors and 1GB of main memory. In the following experiments, as Stop Words, we 
chose symbols such as “-,+,. . .”, numbers such as “0,1,2,. . .”, pronouns such as 
“it, this, . . .”, articles “a, an, the”, and auxiliary verbs “can, may, . . .” and so 
on. 

We apply these algorithms for Reuters-21578 text categorization collection 
in [9], which has 21,578 SGML documents and its size is about 28.0MB, in cases 
of each minimum support in {0.06, 0.08, 0.10} and each number of documents 
in {5,000, 10,000, 15,000, 20,000, 21,578}. 

Fig. 6 (a), (b) and Fig. 7 (c), (d) show the running times of Find-Freq-CPP, 
Prev -Algorithm, Find-Freq-Rep-CPP and Prev -Rep -Algorithm in the experi- 
ments, respectively. Each running time is a time needed from reading all in- 
put documents up to finding all frequent CPPs or all frequent representative 
CPPs. Tree structures such as SGML/XML files can be freely defined by us- 
ing strings given by users as tags. So, in case of analyzing miscellaneous tree- 
structured documents, it may not be necessary to focus on tag’s names. Then, 
Find -Freq-Rep -CPP is suitable for extracting structural features among words. 
From both Fig. 6 and 7, Find-Freq-CPP and Find-FreqJiep-CPP are more ef- 
ficient than Prev -Algorithm and Prev-Rep-Algorithm given in [6], respectively. 
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Moreover, for lower minimum supports, we see that both FindJFreq-CPP and 
Find-Freq_Rep_CPP have good performance for analyzing a large set of tree- 
structured documents such as Reuters news-wires. 

5 Concluding Remarks 

In this paper, we have considered the problem of extracting structural features 
among words from a set of tree-structured documents. We have defined a con- 
secutive path pattern on a list of words and have shown efficient data mining 
algorithms for solving the problem. Moreover, in order to show the efficiency 
of our algorithms, we have reported some experimental results on applying our 
algorithms to Reuters news- wires. This work was partially supported by Grant- 
in-Aid for Young Scientists (B) No. 14780303 from the Ministry of Education, 
Culture, Sports, Science and Technology and Grant for Special Academic Re- 
search No. 2117 from Hiroshima City University. 
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Abstract. This paper investigates a number of techniques for calibra- 
tion of the output of a Support Vector Machine in order to provide a 
posterior probability P(target class|instance). 

Five basic calibration techniques are combined with five ways of correct- 
ing the SVM scores on the training set. The calibration techniques used 
are addition of a simple ramp function, allocation of a Gaussian den- 
sity, fitting of a sigmoid to the output and two binning techniques. The 
correction techniques include three methods that are based on recent the- 
oretical advances in leave-one-out estimators and two that are variants of 
hold-out validation set. This leads us to thirty different settings (includ- 
ing calibration on uncorrected scores). All thirty methods are evaluated 
for two linear SVMs (one with linear and one with quadratic penalty) 
and for the ridge regression model (regularisation network) on three cat- 
egories of the Reuters Newswires benchmark and the WebKB dataset. 
The performance of these methods are compared to both the probabili- 
ties generated by a naive Bayes classifier as well as a calibrated centroid 
classifier. 

The main conclusions of this research are: (i) simple calibrators such as 
ramp and sigmoids perform remarkably well, (ii) score correctors using 
leave-one-out techniques can perform better than those using validation 
sets, however, cross-validation methods allow more reliable estimation of 
test error from the training data. 



1 Introduction 

When applying data mining techniques to a particular predictive modelling task, 
one typically trains a model to produce a score for each instance such that the 
instances may be ordered based on their likelihood of target class membership. 
However, in many applications of interest, it is not sufficient to be able to induce 
an ordering of cases. For instance in a marketing campaign if one needs to action 
say, 20% of the target class cases, an ordering based on likelihood of membership 
is not enough. Instead, what is required is an accurate estimate of the posterior 
probability P(target class [instance). 

The problem of such estimation, particularly for decision tree classifiers, has 
been an area of active research for some time. It has been noted that deci- 
sion trees trained to maximise classification accuracy generally behave poorly as 
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conditional probability estimators. This has prompted methods for improving 
decision tree estimates. In the paper by Zadrozny and Elkan [18], a number of 
such methods including various smoothing techniques, a method known as cur- 
tailment and an alternative decision tree splitting criteria are compared. In addi- 
tion, this paper also describes a histogram method known as binning and applies 
this method to naive Bayes classifiers with some success. This success prompted 
the application of the binning method to calibrate the output of Support Vector 
Machines (SVM) [4]. Other calibration techniques converting classifier scores to 
conditional probability estimates include the use of the training set to fit a sig- 
moid function [14] and other standard statistical density estimation techniques 
[6,16]. All of these techniques may be applied to any ranking classifier, i.e., a 
classifier that returns a score indicating likelihood of class membership. 

It is known that SVMs tend to overtrain, providing a separation of support 
vectors which is too optimistic. Hence, for reliable empirical estimates based 
on the training data, it is advantageous to use corrected scores for support 
vectors prior to calibration. We investigate correction techniques based on recent 
theoretical advances in the leave-one-out estimator [7,8,13] and compare them 
to techniques based on hold-out validation sets. 

In this paper, we evaluate the applicability of different calibration (Sec- 
tion 3.2) and score correction (Section 3.1) techniques for allocating posterior 
probabilities in the text categorisation domain. We also compare these probabil- 
ities against those estimated by naive Bayes classifiers (Section 4.2) and evaluate 
the reliability of test error estimates from training data (Section 4.3). 

2 Metrics 

In this section we briefly introduce the basic concepts and methodology for 
quantifying the performance of learning machines which are suitable for the 
purpose of this paper. We focus on the simplest case of discrimination of two 
classes. 

We assume that there is given a space of labelled observations {x, y) £ X x 
{±1} with a probability distribution P(x^y). Our ultimate goal is to estimate 
the posterior probability P[y|x]. This will be done via estimates of the posterior 
P\y\f], where / : V — >■ K is a model able to score each observation x £ X. These 
estimates will be of the form 



where ^ : K — >■ [0,1] CK is a particular calibration method , i.e. a method of 
allocating a class probability to scores. 

The main metric chosen in this paper for quantification of quality of these 
estimators is the standard error, which is the square root of the mean squared 
error (MSE). Other metrics such as average log loss have been investigated. 
In [18] MSE and average log loss were found to provide similar evaluations of 




for y = 1, 
for y = — 1. 
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candidate probability estimators. We have used the standard error because the 
log loss metric can become infinite under certain conditions. 

The squared error for a single labelled example a; € X is defined as 

y=±l 

In practice, for a test set of m examples 

:= {{xi,yi), ...., (xm, 2/m)) e (X X {±1})™, (1) 

where true labels yi are known but not true probabilities P[yi\xi\, we shall sub- 
stitute 1 for P[yi\xi] and Q ior P[—yi\xi] [1,18]. This leads us to the following 
empirical formula for the standard error: 



SEM) 







( 2 ) 



3 Classifiers and Estimators 

In this research, we have used five classifiers: two linear support vector machines 
(SVM), one Ridge Regression model, a simple centroid classifier and a naive 
Bayes classifier. Naive Bayes have been implemented as described in [11] and 
class probability was defined accordingly by generated posteriors for two cate- 
gories of interest. The other four classifiers are implemented as described below. 



Support Vector Machine {SVM). Given a training m-sample (1), a learning 
algorithm used by SVM [2,3,17] outputs a model f^m : X — K defined as the 
minimiser of the regularised risk functional: 

m 

f^\\f\\li + Y.L{[l-y.f{x.)]+). (3) 

i=l 

Here % denotes a reproducing kernel Hilbert space (RKHS) [17] of real valued 
functions / : X — >■ R, ||.||^^ the corresponding norm, L : R — >■ R+ is a non- 
negative, convex cost function penalising for the deviation 1 — yif{xi) of the 
estimator f{xi) from target yi and [^]+ := max(0,^). In this paper we shall con- 
sider exclusively the following two types of the loss function: (i) linear (SVM^): 
L{^) := Cf, and {ii) quadratic (SVM'^): L{f) := where C > 0 is a regular- 
isation constant. 

The minimisation of (3) can be solved by quadratic programming [2] with 
the formal use of the following expansion known to hold for the minimiser (3): 

m 

fsp^m{x) = ^a,y,k{xi,x), ( 4 ) 

m 

\\fxi--\\H = Yl o^^aJyiyJk{x^,Xj), 
i,i=l 



( 5 ) 
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where fc : X x AT — >■ K is the kernel corresponding to the RKHS % [9,15]. The 
coefficients are unique and they are the Lagrange multipliers of the quadratic 
minimisation problem corresponding to the constraints yifx^-^{xi) > 0. 

If Oi yf 0, then Xi is called a support vector. It can be shown that yifx^-m{xi) < 
1 for each support vector. For experiments in this paper, it is convenient to 
slightly expand the concept of a support vector to the subset 

SV := {i ; yifx^^{xi) < 1 + e for i = 1, , ..,m} 

where e > 0 is a constant. This definition attempts to accommodate the fact 
that typically practical solutions are sub-optimal due to early stopping. 

Ridge Regression {RN^). In addition to the SVMs we also use a Regu- 
larisation Network or ridge regression predictor, RN^ [3,5,9,15]. Formally, this 
predictor is closely related to SVM"^, the only difference being that it minimises a 
modified risk (3), with loss L = C{l — yif{xi)Y rather than L = C[l — ?/i/(a;i)]^. 
As we shall see, both machines may display quite different performance. 

Centroid Classifier (Cntr). The centroid classifier is a simple linear classifier 
with the solution, 

„ , , J2z,y,=+lHXj,x) J2z,y, = -lHXj,x) 

^ 2max(l,m+) 2max(l,m_) ’ 

and m+ and m_ denote the numbers of examples with labels yi = -1-1 and pi = 
— 1, respectively. In terms of the feature space, the centroid classifier implements 
the projection in the direction of the weighted difference between the centroids 
of data from each class. It can be shown that the centroid solution approximates 
the solution obtained by SVMs at very low values of C [10]. 

3.1 Score Correctors 

It is known that support vector machine tends to overtrain providing a separation 
of support vectors which is too optimistic. Hence, for reliable empirical estimates 
based on the training data it is advantageous to use corrected scores for support 
vectors prior to calibration. We investigate three correction methods based on 
leave-one-out (LOO) estimator for SVMs. The first two corrections are based on 
a theoretical bound, and are applied to “spread” the scores of support vectors 
and other training instances that have scores close to ±1. The Jaakkola-Haussler 
correction (f^^{x)) is the lower bound of the LOO estimator derived in [7,15] 
and is applied to the support vector set SV only. The Opper-Winther correction 
(/^^(a:)) is a simplified approximation of the LOO estimator derived in [13,15] 
and is applied to the the union of the support vector and the margin support 
vector sets (S'V U mSV). We use the following “practical” definition for the set 
of margin support vectors: 



mSV := {i ; - 1| < e for i = 1, , ..,m} 
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where e > 0 is a constant. This set is a “practical” substitute for the theoretical 
set of the margin support vectors studied in [13] taking into account that a 
particular practical solution is only an approximation of the true minimiser of 
the empirical risk. 

These two LOO-based corrections are defined as follows: 

/j^m {xi) - a^yik{xi, Xi) hesv, (6) 

I (xi) - aty, hj if i £ SV - mSV, 

\fxt^{xi) - aiyi[k“sy]H li&msv otherwise. 

Here i = and li^sv is th® indicator function equal to 1 if i G SV 

and 0, otherwise. h^rnSV is defined analogously. Here kij := k{xi,Xj), hmsv = 
V^ij)i j^mSV ^ square matrix of dimension = and [K]n denotes the 

element of the main diagonal of the matrix K corresponding to index i. 

The third method introduced in this paper is a heuristic modification to 
alleviate the fact that the lower bound of Jaakkola and Haussler correction is 
known to be pessimistic. It is defined as follows: 

:= / 3 ^.n(xi) - raiyik{xi,Xi) hesv, for i = 1, ...,m. 

Here, r is a number drawn from a uniform distribution on the segment [0,1], 
hence its working name uniform corrector. The intention here is to “spread” the 
scores for support vectors away from the margin, but not necessarily as much as 
in the f^m case. 

The fourth and fifth methods of score correction evaluated in this paper are 
based on use of a hold-out set. For the first hold-out method {Hold}) the training 
data is split into two parts, one is used for training the classifier, the second one 
is used to derive the parameters of the calibration method. In other words, we 
apply the classifier to the hold-out set, and use this to adjust the calibration 
method optimally to the trained classifier. The second hold-out method {Hold}) 
is the same as Hold^ except that after adjustment of the calibration method the 
whole training set is used to train a new classifier which is used in conjunction 
with the previously derived calibration. 

We report here the results for a single split: 50% for training and 50% for 
Hold^ and Hold}. Note that this is different from the split used in [4] and was 
used to accommodate the larger variations in class sizes that is common in text 
categorisation. 




3.2 Calibration Methods 



We have tested five techniques for allocation of class probability to a classifier’s 
score. The first of them is independent of the training data, while the remaining 
four use the training set. 

Ramp Calibrator. The first method is a simple ramp calibrator: 






(1 + 1)/2 

(1 -k sgn{t))/2 



for |t| < 1, 
otherwise 



( 8 ) 
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for f € K. This calibrator is independent of the training data. 

Gauss Calibrator. The second method implements a routine statistical 
technique for density estimation [6,16]: 



Y.^Kh{fxt--{xi) -t) 



( 9 ) 



where Kh{t) := exp(— 0.5f^/h^) is the popular Gauss kernel. We have used 
h = 0.5 in our experiments and applied (9) to clipped to the range [—3,3]. 
Sigmoidal calibrator. The third method fits a sigmoid to the SVM output: 

+ ( 10 ) 



where constants A, B are determined from the training data using Platt’s method 
[14]. This finds parameters A, B by minimising the negative log likelihood of the 
training data, which is a cross-entropy error function: 

minA,B log ^ ^ ~ 1 + )) ’ 



where ti = (n_|_ -|- l)/(n+ -I- 2) if j/i = 1 and ti = l/(n_ -1-2), otherwise, where n+ 
and n_ are the numbers of positive and negative examples in the training set. 

Binning Methods. The fourth and fifth methods for calibration are vari- 
ants of the histogram method [4,18]. This method proceeds by first ranking the 
training examples according to their scores, then dividing them into b subsets, 
called bins. Each bin is defined by its lower and upper boundary. Given a test 
example x, it falls into the bin if and only if its score f{x) is within bin bound- 
aries. The estimate ^[1/1 a;] of the probability that x belongs to the class y is set 
to the fraction of the training examples of class y which are in the same bin. 

In our experiment we have used a fixed number of bins & = 10 as in [4,18]. 
We have used two different methods for definition of bin boundaries. The first 
binning method, referred to as BirA, is exactly as in the above references: it 
selects bin boundaries such that each bin contains the same number of training 
examples (irrespective of example class). A potential problem arises with this 
method when the data is heavily biased and well separable by a classifier. This 
is exactly the case often encountered in text mining using SVMs, where the 
target (minority) class accounts for less than 10% of the data and almost all 
target examples fall into the top 10% segment of scores, i.e. a single bin. To 
alleviate this problem, we have used another binning method, referred to as BirA . 
In this method we first set a default bin as the open segment (— oo,o), where 
a = min{f{xi) ; yi = 1} (we always assume that the target class corresponds 
to the label y = -1-1). Hence this bin contains no target class examples. Then 
we set boundaries of the remaining top 6—1 bins in such a way that the target 
class cases in the training set are evenly split into 6—1 equal subsets. In this 
case we typically encounter some larger and smaller bins, with the largest, the 
default bin, containing no target class cases. As the experimental results show, 
this method outperforms BirA in cases where there are few target cases in the 
training set (c.f. Figure 1). 
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4 Experiments and Results 

In our evaluation, we use two distinct corpora of documents and different cate- 
gories from each of the corpora. The experimental results presented for each of 
the settings are averages over 20 models with a one-standard deviation bar to 
show the variance in the results. Each of the 20 models is created using a fraction 
(p) of the total number of examples in the corpus, and testing on the remaining, 
where these training examples are selected randomly and the sampling is not 
proportional, i.e., non-stratified random sampling is performed. 

4.1 Data Collections 

Our first corpus is the Reuters-21578 news-wires benchmark. Here we used a col- 
lection of 12902 documents (combined test and training sets of so called modApte 
split available from http://www.research.att.com/lewis) which are categorised 
into 115 overlapping categories. Each document in the collection has been con- 
verted to a 20,197 dimensional word-presence vector, using a standard stop-list, 
and after stemming all of the words using a standard Porter stemmer. 

Our second corpus is the WebKB dataset which consists of web pages gathered 
from university computer science departments [12]. After removing all those 
pages that specify browser relocation only, this set consists of 7,563 web pages 
split across 7 categories. In this paper, we use the three most populous categories 

‘other’): Course, Student and Faculty. We parse the HTML tags, and use all 
the words without stemming, which results in a feature set consisting of 87,601 
words. We do not use a stoplist since some of the typical stopwords such as “my” 
are useful for prediction in this case. 

4.2 Standard Error Results 

Figure 1 shows the standard error for three Reuters categories: category “earn” 
the largest category that is roughly 30% of the data (first row), category “crude” 
that is just 4% of the data (second row) and category “corn” which form 2% of 
the data (third row). For this set of results, we have used a split of 10% training 
and 90% test (p = 0.1). 

Results are presented for calibration of raw scores (first column) as well 
as the three LOO-based score correctors (columns 2-4) and the two correctors 
based on hold-out sets (columns 5-6). Within each column, calibration results are 
shown for all of the five calibrators described in Section 3.2, although the ramp 
calibrator is unaffected by correction of training scores and hence has a different 
result only for HOLD^ which uses a smaller training set for model generation. 
Standard error is presented for all 30 combinations for three machines: SVM 
with quadratic penalty first marker: cross), SVM with linear penalty 

{SVM^, second marker: diamond) and ridge regression {RN^, third marker: 
square). For all machines, the regularisation constant in the SVM equation in 
Section 3 is set to C = 800/m, where m is the number of instances in the training 
set. In addition, we present the error for the centroid classifier (fourth marker: 



368 A. Kowalczyk, B. Raskutti, and H. Ferra 



Standard Error for three Reuters Categories; Train/Test Split = 10:90 




o' ^ ' ' ^ ^ ^ ^ ^ 

Ra Ga Si B1 B2 RaGa Si B1 B2RaGa Si B1 B2RaGa Si B1 B2 Ra Ga Si B1 B2 Ra Ga Si B1 B2 



Fig. 1. Standard error for three Reuters categories for different score correctors and 
calibrators for different machines. We denote the various calibrators as follows: ramp 
(Ra), Gauss (Ga), sigmoidal (Si), Bin} (Bl) and Bin^ (B2). Score correctors are 
denoted as follows: (+scJH), f^^{x) (+scOW), /“"*(*) (+scUni), Hold} 

(+Holdl) and Hold} (+Hold2). Raw refers to calibrated, uncorrected scores. The sev- 
enth colnmn shows the probability estimates obtained from naive Bayes Glassiher. 



triangle) for the raw scores and for score correctors based on hold-out sets. The 
last column (column 7) shows the probability estimates obtained from naive 
Bayes Classifier (shown by inverted triangle). Points missing from the graph are 
those with standard error larger than 0.42. 

Comparing different calibration techniques, we note the following: 

1. The ramp calibrator performs extremely well across all score correction 
methods except /^^(x), and across all classifiers except centroid. In general 
the ramp method applied to the raw scores of either SV or SV produces 
estimates very close (or equal) to the minimum standard error over all the other 
methods. This is quite impressive given the simplicity of the method. A further 
point to note is that the method is very consistent over different training set 
samples showing very little variance (the notable exception being column 2, i.e., 
after correction using /^l(x)). 

2. The ramp calibrator performs very poorly with the centroid classifier, 
however, other calibrators such as sigmoids and gauss as well as the binning 
ones, return good estimates for the centroid classifier. Thus, more sophisticated 
calibration methods are useful when the classifier is simple and does not separate 
with a margin. 

3. Regardless of the correction method used, the gauss calibrator estimates 
are much worse than any other calibrator and also have much higher variance. 

4. Binning methods, particularly Bin^, do not have any advantage over any 
calibrators other than Gauss. As expected, Bir} performs better than Bin^ for 
the two smaller categories (rows 2 and 3), and this is true with score correction 
using hold-out methods, as well as raw scores. 
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Comparing the performance of differeirt score correctioir techiriques, we note 
the following: 

1 . Adjustment of the model scores by use of any of the correction techniques 
prior to calibration has a very limited positive impact, and this happend only 
with RN^ which usually performs worse than the two SVMs on the raw scores. 
One explanation for this behaviour could be the fact that i?iV^ concentrates 
traiiriirg case scores at —1 and -1-1 which leads to a sigirificantly larger nuiuber 
of trainiirg set poiirts which are subsequeirtly corrected. 

2. The performairce of all of the LOO-based correctioir methods is very simi- 

lar. However, there are significant computational advantages for the (x) and 
f^^[x) methods since these, unlike do not involve a matrix inversion. 

3. Although our /^^(x) correction is theoretically justified for the linear 
penalty case only, it is worth noting that the performance in SVM^ case is 
at least comparable to that of SVM^. This is rather unexpected and requires 
further investigation. Specifically, what is needed is to properly derive the Opper- 
Winther LOO correction in the quadratic case to compare the difference. 

4. Not surprisingly, correction of scores prior to calibration helps to reduce 
the variance of estimates. However, there is no significant advantage in using 
cross-validation sets for correction. This suggests that good probability estimates 
can be obtained if the model used for categorisation is trained on all of the 
available data. 

Finally, the probability estimate obtained froiu the naive Bayes classifier 
is reasoirably accurate, although always worse than that obtaiired froiu ramp 
calibration on raw scores with the SVMs. It would be interesting to explore the 
effect of binning on these probabilities [4] . 



Generalisation to Other Datasets. Figure 2 shows the standard error for 
three WebKB categories: Course (12% of data - first row). Student (22% second 
row) and Faculty (14% of data, third row). The notation, the format and the 
training/test split (10%/90%) is the same as that for Figure 1. These experiments 
were performed in order to understand how the above observations hold with 
other datasets. 

Comparing Figures 1 and 2, we see that there are two significant differences 
regarding performance of different machines. Firstly, SVM^ performs signifi- 
cantly better than SVM'^ . However, the observation regarding ramp calibrator 
with raw scores still holds with SVM^. Secondly, all score correction methods 
have a positive impact on both SV and RN"^ and bring it closer to the perfor- 
mance of SVM^, however, the impact of score correction on the best performer, 
SVM^ is not consistent. This is in line with the observation that correctors 
improve the performance when the baseline performance is sub-optimal (e.g., 
improvement for RN^ for the Reuters dataset). 



Impact of Training Set Size. In order to determine the effect of the number 
of training examples on the different correctors and calibrators, the experiments 
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Standard Error for three WebKB Categories; Train/Test Split = 10:90 
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Fig. 2. Standard error for three WebKB categories for different score correctors and 
calibrators. The settings and notations are the same as that for Figure 1. 

described in the previous Section with a training/test split of 10:90 were repeated 
with a split of 50:50 (results not shown here due to space considerations). We 
observed that the only impact of the larger training set size is the lower standard 
error and the lower variance across all categories, calibrators, correctors and 
classifiers. This is true for both the Reuters and the WebKB datasets. 

4.3 Training/Test Error Ratio 

In order to select the best performing methods we need to have an estimate of 
the performance of the method on unseen data. One way to achieve this is to 
take part of the labelled data and hold it in reserve as an independent test set. 
Alternatively, the performance on the training set can be used as an estimate 
of the method’s performance on unseen data. As a guide to the accuracy of 
such an estimate, we use TrTeErr or Ratio, the training/test error ratio which 
is defined as the ratio of standard error on the training set to that on the test 
set. For each method we characterise the accuracy of the training set estimate 
using this TrTeError Ratio. Methods that provide TrTeErr or Ratio close to 1 
and good estimates on the training set are preferable to methods that provide 
good estimates on the training set, but have poor TrTeError Ratio (i.e., << 1). 
If we are confident that a particular method has TrTeErr or Ratio close to 1, 
then we can use all of the labelled data for model generation, which as noted 
above leads to more accurate models. 

Figure 3 shows the TrTeErr or Ratio for the three Reuters categories for the 
same settings and in the same format as in Figure I. As can be seen from Fig- 
ure 3, the estimates of performance derived from the training set are consistently 
optimistic for the SVMs {TrTeErr or Ratio << I). This is common to all the 
calibration methods, with the binning techniques, especially Bin}, appearing a 
little more immune to this than the other methods. This optimism is consistent 
with the observation that the SVM raw scores are known to be overtrained. 
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TrTe Error Ratio for three Reuters Categories: Train/Test Split = 10:90 
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Fig. 3. TrTeErr or Ratio for three Reuters categories for different score correctors and 
calibrators. The settings and notations are the same as that for Figure 1. 



SVM^ is the worst performer having TrTeErr or Ratio close to 0 in most cases, 
while RN^ , and naive Bayes have a TrTeErr or Ratio around 0.75. The centroid 
classifier provides very good estimates (~ 1), however, the standard error for 
this classifier is generally quite high (Figures 1 and 2). 

All correction methods tend to move the TrTeErr or Ratio closer to 1, with 
hold-out sets having the most impact. Hold} ^ in particular, provides consistently 
good estimates even with little available data, while the other correction methods 
tend to become more optimistic as training data becomes scarce. This is expected 
since overfitting is usually worse when there are few examples to learn from. 



5 Conclusion 

The paper has demonstrated the feasibility of extracting accurate probability 
estimates from SVM scores for text categorisation. Our evaluation with multiple 
text datasets shows that simple calibrators such as ramp calibrator can provide 
very accurate probabilities. While score correction does not improve the proba- 
bility estimates, it does indeed provide better estimation of the accuracy of the 
classifier on unseen data. The best compromise between accuracy of probability 
estimates and prediction of performance on unseen data is provided by the Bin} 
calibrator on raw scores. 

Future work indicated by these results include investigation of other real 
data sets and carefully designed artificial data. A significant effort should be 
directed towards developments of practical techniques (heuristics) for training 
score correction. Results of this paper for our uniform score corrector show that 
such heuristics exist. Finally, the probability estimates obtained here should be 
tested against calibrated naive Bayes classifiers, and tested for their utility in 
multi-class text categorisation. 
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Abstract. Concept drifting is always an interesting problem. For in- 
stance, a user is interested in a set of topics, X, for a period, may switches 
to a different set of topics, Y, in the next period. In this paper, we fo- 
cus on two issnes of concept drifts, namely, concept drifts detection and 
model adaptation in a text stream context. We use statistical control 
to detect concept drifts, and propose a new multi-classifier strategy for 
model adaptation. We conducted extensive experiments and reported our 
hndings in this paper. 



1 Introduction 

For many learning tasks where data is collected over an extended period of time, 
its underlying distribution may change. A changing context will induce changes 
in the concept of the collection of data, producing what is known as concept 
drifts. 

Knowing concept drifts is important in many applications. For instance, a 
news agent needs to decide the top stories and headlines every day or even every 
hour based on the interests of its users. An information dissemination system 
needs to decide how to broadcast news stories or promotions to its users who 
will be most interested in. An email filtering system should be able to filter out 
the spam email. All these require to trace the pattern of the users interests. 

In practice, up to some extent, it is possible to trace users interests with 
historical data. For instance, we can trace the set of articles that the users are 
interested in by monitoring the time that they spent on browsing the articles. 
For removing the spam email, it could be done by tracking the email that are 
kept and deleted by the users. Yet, in different time horizons, users may be 
interested in different things. For example, one may be interested in receiving 
materials related to president election during the election campaign period, but 
this does not mean that the user is always interested in politics. 

Broadly speaking, the problem of concept drifts can be divided into two sub- 
problems: drift detection and model adaptation [3,4,5,6,7,10,11,12]. Detecting 
changes are mostly discussed in information filtering [5,6,7]. On the other hand. 
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model adaptation is often done by maintaining a sliding window, either fixed size 
or adaptive size, on the training data [4,12]. While these heuristics are intuitive 
and work very well in many domains, most of them required complicated tuning 
on their parameters, and the tunings are not transferable to other domains. 

[3] describes a roubust window approach for model adaptation using support 
vectors machine (SVM). However, their window adjustment algorithm requires 
to compute the ^a-estimator on every batch from scratch which is too expensive 
in practice. [10] illustrates how to maintain the support vectors for handling 
concept drifts. However, they did not show any situations where concepts drift. 
[11] proposes a framework for mining concept-drifting data streams by using a 
weighted ensemble classifiers. The authors show that using a weighted ensemble 
classifiers will perform substantial better than just using a single classifier. 

In this paper, we re-visit the problem of concept drifts in text stream contexts, 
and study both concept drifts detection and model adaptation. We focus on a 
single classifier but not a community of classifiers. For detecting concept drifts, 
we make use of statistical quality control to detect any changes in a text stream. 
For model adaptation, we propose an approach that uses different kind of model 
sketches (short-term and long-term) . No complex parameter tuning is necessary. 
We do not use sliding window for model adaptation. 



2 Concept Drifts 

A set of text documents is classified into multiple topics. Each topic is labeled 
as either interesting (C_|_) or non-interesting {C-). A concept is the main theme 
of the interesting topic(s). Concept drift means that the main theme of the 
interesting event is changed. In the following, we follow the assumption that 
text streams arrive as batches with variable-length: 

C?1 1 , . . . , di jYi^ , (^2.1 , .. ., d.2^^2 ; * * * ; ^n,l 7 ; ^n.rrin 7 ' ” 

where dij denotes the document in the batch. The i-th batch consists of 
rrii documents {nii > 1). We use and Cf to denote the two classes C+ and 
C- at a given time t. A document at time t is labeled as either interesting if it 
belongs to or non-interesting if it belongs to Cf . 

Figure 1 illustrates a simple concept drifts scenario by using two sets of topics, 
X and Y . There are 10 time intervals: ti, ^ 2 , • • • j tio- Documents arrived within 
[ti-i,ti] form a batch. From to to ta, the users are interested in X but not Y. 
Beginning from t^, the users decrease their interest in X but gain interest in Y. 
Finally, from tj, the users totally lose their interest in X and interest in Y only. 

Our problem is how to maintain such interesting and non-interesting concepts 
in a text stream environment, with two objectives: 1) maximizing the accuracy of 
classification; and 2) minimizing the cost of re-classification. Two related issues 
are: concept drifts detection and model adaptation. Concept drifts detection 
is to figure out whether the concept in the stream is changed whereas model 
adaptation aims at capturing new concepts immediately. 
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Level 




Fig. 1. A simple example of concept drift. 

Table 1. The contingency table for measuring the quality of a classifier. 





Expert judgments 


Yes 


No 


Classifier 

judgments 


Yes 


TP 


FP 


No 


FN 


TN 



3 Concept Drifts Detection 



In this section, we introduce some measurements for classification quality, and 
address our concerns. Then, we discuss our solutions. 

When a text stream is being classified by a classifier and the true class labels 
of these documents are given, the relationships between the classifier decisions 
and their corresponding true class labels can be summarized as shown in Table 
1. For measuring the effectiveness of a classifier, a common approach is to use 
precision (tt) recall (p) and Fi [9]. 



_ TP _ TP c. _ 2 ■ TT • p 

~ TP + FP TP + FN ^ “ 7T + P 

Yet, none of them are suitable for detecting concept drifts. The reason is that 
TT (p) does not take FN and TN {FP and TN) into account, meanwhile Fi 
does not take TN into consideration. Therefore, we can not notify any drift in 
the non-interesting events. In this paper, we use error rate, e, for measuring the 
effectiveness of a classifier, as it captures all four dimensions. 



FP + FN 

TP + TN + FP + FN 



( 2 ) 



However, detecting concept drifts by simply comparing the error rates of two 
consecutive batches is insufficient. Consider Figure 2 in which users gradually 
lose their interest in topic X after time The bottom of Figure 2 shows a 
typical result for the error rate of the classification. There always exist small 
fluctuations on the error rate. Comparing the error rates directly may fire false 
alarms. 
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tn tn+1 tn+2 



Fig. 2. Concept drift and error rate. 

In order to determine whether it is a fluctuation or a drift, we use statistical 
quality control [8]. The basic idea is to test whether or not a single observation 
(e) is within a tolerance region: 



En - NOn <e<En+ NUn 



( 3 ) 



where is the mean error rate of the n previous observations; is the corre- 
sponding standard deviation and N is the control tolerance. In this paper, we 
set N = 2. 

When a change is detected at t, we will adapt the classifier to include or 
remove any concepts from and Ci. Note that detecting possible concept 
drifts is insufflcient for maintaining a good classification model. For instance, 
consider two scenarios. In Scenario-A, the concepts X and Y interchange slowly, 
while in Scenario-B X and Y interchange fast. In order words, the drift rates 
are different. The consequence of ignoring such drift rate is that the new model 
may adopt the drift too quickly (too slowly) and drop the existing concepts too 
early (too late). Here, we introduce the term expected drift rate. 

Assume that a drift is being detected at time tn+i- The expected drift rate, 
is measured by the slope of 6t: 



£n 

^n+1 



( 4 ) 



Note that £„ is used instead of e„. This is because taking only one previously 
observation (e„) may yield a very biasing result. By combining (f> and the current 
error rate, the expected error for the next time interval is: 



£n+2 — 4> • + £ti+1 



( 5 ) 



4 Model Adaptation 

The key point of model adaption is to maintain a high quality model such that 
the newly adapted model is best suitable for the next incoming document. 

We argue that model adaptation based on sliding window is not the best 
choice. Let us consider Figure 1. From t^ to ty, the users interest toward X (Y) 
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decrease (increase). At time 50% of the documents in X should be classified as 
, whereas another 50% should be classified as . In such a case, the classifier 
will be confused because the number of the positive and negative examples are 
the same. In this paper, we extend our previous work on a similarity-based 
classifier, called discriminative category matching (DCM) [1]. Classification is 
achieved by comparing the unseen document sketch with different class sketches. 
DCM summarized the information of features in a document, within a class and 
across different classes. 

4.1 Dynamic Sketches Constructing/Removing 

We create and maintain two class sketches for C+ and C_, denoted k+ and fc_, 
respectively. Let us explain dynamic sketch constructing/removing using Figure 
1 as an example. 

Initially, two sketches, and fc_, are created at to, for C+ and C_. At to, 
the users interest begins to change. Suppose that this change is being detected 
at ^ 4 , we create two new sketches, and k'_ at ^ 4 , which are served as short- 
term classifiers. From then, we will maintain four sketches: k'^, k'_, fc+ and fc_. 
During the period from to to tj, suppose that the expected drift rate at ti is <pi 
for i G [4. . 6 ]) (</>i at ti can be different from <f>j at tj for ti yf tj). According to 
4>i, we can estimate the probability of user interest, ji, for Y in next batch as: 

7i+i = -I- 7i ( 6 ) 

In other words, 7 documents belonging to Y should be classified as C+ and the 
remaining should be classified as C_. Similar applies to X. Finally, after ty, the 
error rate for k'^ and k'_ becomes stable and within the tolerance region and the 
error rate for k+ and fc_ increase to maximum. Now, there is no reason to kept 
and fc_, they will be removed and k\. and k'_ will be kept for classification. 
A question remaining unanswered here is: how to create short-term sketches? 
For every learning algorithm, two types of examples must be provided: positive 
and negative. Positive examples for k'j^ and k'_ can be collected by choosing the 
mis-classified examples from fe+ and fc_ as the mis-classified documents should be 
those with changing concepts. However, we do not know the negative examples. 

Our solution is given here. First, we select feature that acts actively in the 
positive examples such that its occurrence is skewed in positive examples. These 
features are selected according to the feature strength: 

^{d, i) = • (WC.,+ - wa,u) (7) 

where WCi^+ is the weight of feature i in the positive examples and WCi^u is 
the weight of the corresponding feature in all of the documents in a batch. Refer 
to [1] for the calculation of WC. Furthermore, let r{d,j) be a sorted list of 
feature strengths for d. We select the top K features in each positive example 
and form a set of positive features, K in each document may be different, and 
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Table 2. 10 selected topics from the Reuters-21578 data set and the number of docu- 
ments assigned to them. 



Topic 


Documents 


Topic 


Documents 


Ship 


158 


Acquisition 


2362 


Money-fx 


307 


Earn 


3945 


Ship 


158 


Coffee 


116 


Sugar 


143 


Crude 


408 


Trade 


361 


Interest 


285 



User Interest 




User Interest 
Level 



User Interest 
Level 





Time 



Fig. 3. Different type of concept drift. 



is determined using mean square error, such that E{r{d,j)) < E{r{d,j + 1)) 
for j < K and E{E{d, K)) > E{E{d,K + 1)), where 

E{E{d,j)) = - r(d,j + 1 ) 2 ) ( 8 ) 

We repeat the similar procedure, and extract the negative examples from the 
universe such that d fl C = 0. 

5 Evaluations 

The data set used for evaluation is Reuters-21578. Table 2 shows the 10 selected 
topics. All features are stemmed and converted to lower cases. Four measures 
are used for evaluating the performance of the classifier: 1) Precision (p); 2) 
Recall (tt); 3) Error rate (e); and 4) Fi. Following the existing works [3,4,7], 
we simulate three drifting scenarios between two topics: Earn and Acquisition. 
All other topics are served as noise, and are not relevant to the user interest. In 
each scenario, documents from each topic are evenly distributed into 21 batches 
(batch 0 to batch 20). Batch 0 serves as the initial training set, and the remaining 
20 batches represent the temporal development: 

— Scenario A (Figure 3 (a)): From batch 0 to batch 8, documents from Earn 
are relevant with respect to the user interest. This changes gradually from 
batch 9 to batch 12, and finally only Acquisition is relevant beginning from 
batch 13. 
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Fig. 4. Results for Exp 1 




(a) Scenario A 



(b) Scenario B 



(c) Scenario C 



Fig. 5. Results for Exp 2 



— Scenario B (Figure 3(b)): A new relevant topic arises gradually while an- 
other relevant topic is always present. From batch 0 to batch 8, only Earn 
is relevant. Acquisition is added to the relevant text starting from batch 9 
to batch 12. 

— Scenario C (Figure 3(c)): The relevant topic is split into two, and one 
of them gradually becomes non-relevant. Initially, Earn and Acquisition 
are considered as relevant. Starting from batch 9, Acquisition gradually 
becomes non-relevant. 



5.1 A Comparison of Different Approaches 

Four experiments are conducted for comparison: 

— Exp 1 (Only-Once): The classifier is learned only from batch 0, i.e. the 
classifier will not be updated. 

— Exp 2 (Incremental): The classifier is learned throughout all of the 
batches, i.e. at batch z, the classifier is learned from batch 0 to batch {i — 1), 
and batch i is used for testing. 

— Exp 3 (Based on Concept Drifts): The classifier is constructed in the 
same way as discuss in this paper. 

In Exp 1 (Figure 4), initially, the performance of only-once classification in all 
scenarios is good. However, whenever a drift occurs, its performance deteriorates 
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(a) Scenario A (b) Scenario B (c) Scenario C 



Fig. 6. Results for Exp 3 

significantly. Scenario B shows that the measurement “precision” cannot detect 
the concept drifts, while Scenario C shows that the measurement “recall” fails 
to do so. In Exp 2 (Figure 5), the performance of incremental classification in 
all scenarios is slightly better than that of only-once classification (Exp 1), but 
still unacceptable. Exp 3 (Figure 6) performs the best. 



6 Detailed Analysis 



In this section, three situations are test: I) Varying the amount of noise; 2) 
Varying the drift rate; and 3) Varying the extend of drift. Due to the space 
limitation, only Scenario A is demonstrated. 



6.1 Varying the Amount of Noise 

Two situations are simulated: 

~ Case-1: a% of the target concepts, Earn and Acquisition, will be used 
randomly. For example, in the extreme case, when a = 100, a completely 
random data will be generated. Here, three a are chosen: 10, 20 and 30. 

— Case-2: Some topics are added and/or removed. Three situations are mod- 
eled: 1) Last five topics from Table 2 are removed. 2) Additional topics are 
added to the non-interesting event on top of the first situation. 3) Additional 
five topics are added to the relevant event on top of the second situation. The 
topics are chosen randomly. This case aims at determining the effectiveness 
of the statistical quality control. 

Figure 7 and Figure 8 show the results for Case-1 and Case-2, respectively. 
In Figure 7, the general shape deteriorates from (a) to (c). This is expected as 
more noise present. Note that immediate after the drifts, all situations return to 
their corresponding stable stage quickly. This suggests that the proposed drift 
rate is useful. 



Classifying Text Streams in the Presence of Concept Drifts 381 




(a) a = 10% (b) a = 20% (c) a = 30% 

Fig. 7. Results of varying noise for Case-1 




(a) Situation 1 (b) Situation 2 (c) Sitnation 3 



Fig. 8. Results for varying noise for Case-2 

6.2 Varying the Drift Rate 

Sometimes concepts will change gradually, creating a period of uncertainty be- 
tween stable states. In other words, the new concept only gradually takes over. 
Consider Figure 2 again. By varying At, we can simulate the situation of varying 
the drift rate. Three drift rates are chosen: 1) Z\t = 0 (abrupt drift); 2) At = 1 
(fast drift); and 3) At = 8 (slow drift). All of these units represent the number 
of batches. The concept drifts here all begin at batch 8. 

Figure 9 shows the results of varying the drift rate. For At = 0, the quality 
of the classifier drops to minimum in batch 9 when the concept begins to drift. 
When At = 0, the concept is not drift but is shift. In this situation, we do not 
have any previous knowledge to predict when a shift will happen. However, the 
classifier can adapt the new concept in the next batch quickly. This is due to the 
drift rate proposed in Section 3. 

6.3 Varying the Extend of Drift 

Computational learning theory quantifies the extend of drift as the relative error 
between two concepts, which is the probability that a randomly drawn example 
which should be classified as A is mis-classified as B. Thus, the extend of drift 
is the dissimilarity between two successive concepts A and B. 

In general, such an experiment is not easy to conduct, because there is no 
such a benchmark to denote the similarity between two text categories. However, 
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(a) At — 0 (b) At = 1 (c) At = 8 



Fig. 9. Results for varying the rate of concept drift 




Fig. 10. Results for varying the extent of concept drift 

as pointed out by many researchers that some Reuters’ categories are “difficult 
categories” [13,2,9]. [9] shows that Crude and Trade are more difficult to be clas- 
sified correctly rather than Earn and Acquisition. We conduct an experiment 
using the two “difficult categories”, Crude and Trade. Figure 10 shows the re- 
sults. As expected, the quality decreases further. However, concerning with the 
detection of concept drifts, it seems that it does not have difficulties. 

7 Conclusion 

In this paper, we presented an effective approach for classifying text streams 
in the presence of concept drifts. We plan to study how to deal with recurring 
concepts. In the case where a previously existing concept re-appears, it is a waste 
of time to re-learn the characteristics of this concept from scratch. It is possible 
to maintain the potentially useful class sketches and re-examine at a later stage 
for faster convergence. 
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Abstract. We propose a method of selecting initial training examples 
for active learning so that it can reach high performance faster with fewer 
further queries. Our method divides the unlabeled examples into clusters 
of similar ones and then selects from each cluster the most representative 
example which is the one closest to the cluster’s centroid. These repre- 
sentative examples are labeled by the user and become the members of 
the initial training set. We also promote inclusion of what we call model 
examples in the initial training set. Although the model examples which 
are in fact the centroids of the clusters are not real examples, their con- 
tribution to enhancement of classification accuracy is significant because 
they represent a group of similar examples so well. Experiments with 
various text data sets have shown that the active learner starting from 
the initial training set selected by our method reaches higher accuracy 
faster than that starting from randomly generated initial training set. 



1 Introduction 

For a successful application of machine learning to classification tasks, we usually 
need to provide the learner with a sufficiently large number of labeled examples. 
If a large number of examples are available, however, obtaining the labels for 
these data often takes a lot of effort and time depending on the application area. 
Active learning [1,2] copes with this problem by carefully selecting examples to 
be labeled and trying to achieve the best possible performance using as few such 
labeled examples as possible, with the learning and querying stages repeated 
alternately. The focus of previous works on active learning was mainly on what 
and how to select the next example to query for its label, while not much at- 
tention has been given to the issue of selecting the initial training set. Given 
a better initial training set, we expect that the active learner can reach high 
performance faster with fewer further queries. 

It is also worth pointing out that asking for category information for a single 
example at each querying stage could be practically inefficient. The user might 
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show better performance at categorizing examples when not just one but multi- 
ple examples are given simultaneously because he or she can compare different 
examples freely and work on them in any flexible order and thus becomes able 
to assign labels more precisely in a more time efficient manner. Especially when 
we have available enough manpower to answer the queries we can save time by 
having multiple examples labeled in parallel. Considering that active learning 
aims at achieving high performance using limited resources of time and manual 
efforts, selection of multiple examples for initial training seems desirable. 

In this paper, we propose a method of selecting initial training examples for 
the purpose of enhancing the performance of active learning. When the number 
of labeled examples to be used for training is severely limited as is the case in the 
setting of active learning, it is advantageous not to have too similar examples 
to be included together in the training set. Our method therefore first divides 
the given unlabeled examples into clusters of similar ones and then selects from 
each cluster the most representative one to be labeled. In our empirical study, 
we applied fc-means algorithm to cluster text data and selected from each cluster 
the document closest to the centroid as the cluster’s representative. The selected 
documents were then labeled to form an initial training set. Experiments with 
this training set showed much better learning curves than that with randomly 
selected initial training set. 

2 Cluster-Based Sampling to Select Initial Training Set 

Our method of selecting examples for an initial training set starts by clustering 
the unlabeled examples by applying a /c-means algorithm. Suppose we want to 
select two examples to form an initial training set. We first select two examples 
randomly. However, instead of using these examples directly as an initial training 
set (by querying to the user for their labels), our method takes them as initial 
seeds to group all the unlabeled examples into two clusters. Once the fc-means 
algorithm converges to two final clusters after some iterations, a representative 
example is selected from each cluster as shown in Fig. 1(a). Since each cluster 
consists of similar examples which are likely to belong to the same category, it 
seems to be a good idea to pick a representative example from each cluster to 
form an initial training set. As the best candidate for a representative of each 
cluster, one might easily consider the centroid. However, direct labeling of the 
centroids may be difficult because they are not real existing examples. When 
working in the domain of text classification, for example, there is no document 
corresponding to a centroid. We therefore take the example closest to the centroid 
as the representative and ask the user for its category. In fact, the category thus 
obtained can also be assigned to the centroid itself because the two are so close to 
each other. The representatives, after being labeled, optionally together with the 
correspondingly labeled centroids, constitute an initial training set. The classifier 
learned from such an initial training set is shown in Fig. 1(b). The centroids, 
when they are labeled (at no cost of making extra query) and used for training 
examples, are called model examples. We will later show that inclusion of the 
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(a) Selecting initial training examples (b) A class boundary learned with an initial training 

from stabilized clusters set % (with two cluster-basis examples) 



Fig. 1. A result of learning with an initial training set selected by fc-means clustering 



model examples in the initial training set results in even more enhancement of 
learning performance. 

3 Experimental Results 

We have tested our cluster-based sampling method on several text classification 
tasks to form initial training set for active learning, using the two well-known 
text data sets Reuters-21578 and Newsgroups-20 [3]. From Reuters-21578, we 
selected only those articles which belong to one of the two most common topics 
(earn and acq). Newsgroups-20 consists of about 20,000 articles that had been 
posted to 20 different newsgroups. To see the effect of our method on tasks 
of different difficulty levels, we generated from Newsgroups-20 two data sets: 
Diff erent-3 and Same-3 as also used and described in [4]. Different-3 consists 
of articles which were posted to one of the three very different newsgroups, 
i.e., alt. atheism, rec . sport .baseball, and sci. space. The data set Same-3 
consists of articles from three very similar newsgroups, i.e., comp. graphics, 
comp . os .ms-windows, and comp, windows, x. Each article was converted to a 
final form through the process of stemming and removing USENET headers 
except the subject (for Different-3 and Same-3), SGML tags (for Reuters), 
and stop-words. For classification tasks, we applied fc-NN (/c-nearest neighbor) 
algorithm which is known to work very well for text classification. 

We implemented and compared three different strategies for selecting an 
initial training set from a bigger set of unlabeled examples. Random selects initial 
training examples just randomly. The other two strategies both start by selecting 
random seeds and applying a /c-means algorithm to cluster the given data before 
forming an initial training set. While KMeans samples only one representative 
example from each cluster to form an initial training set, KMeans+ME collects 
the model example of each cluster in addition to the representative example. 
Ten-fold cross-validation was repeated for ten times to compare the effects of 
the differently generated initial training sets. 
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Figure 2 shows accuracies of the classifiers learned by using the initial train- 
ing sets generated by the three strategies with the sizes of the initial training sets 
varied from 10 to up to 50. The horizontal axis of each graph indicates the num- 
ber of training examples used. The vertical axis of each graph is the accuracy of 
classifier. The value of k for the A:-NN algorithm was set to 5. Both the classifiers 
obtained by using our methods K Means and KMeans+ME achieved higher ac- 
curacies than that by Random for all the three data sets whether they are easy 
or hard. In Fig. 2, we can see that the ten examples selected by KMeans+ME 
were more valuable for learning than the fifty examples selected by Random. 
These results show how important it is to select good examples for training, and 
also show that our strategies are really effective for the selection work. Notice 
that the model examples have positive effect on learning. 
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Fig. 2. Accuracies of classifiers with various sizes of initial training set 
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Fig. 3. Accuracies of active learner starting with 10 or 30 initial training examples 



Figure 3 shows learning curves of the active learner when started from 10 
or 30 initial training examples. The active learner was allowed to query for up 
to 50 examples including the given initial training set. The active learner used 
in this experiment selected the most uncertain example for the next query. The 
parameter k was fixed to 5 as in the previous experiments. We can easily see 
that our methods consistently outperform Random, especially when the size of 
the initial training set is relatively large. These results reveal that the quality 
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of the initial training set makes significant influence on the subsequent on-going 
performance of the active learner. 

4 Related Works 

The main query selection strategy in active learning is uncertainty sampling [1] 
which selects the most ambiguous example for the next query. Roy and McCal- 
lum [2] argued that the best example for query is the one which, if labeled, can 
be used in the next learning stage to derive a classifier of minimum expected 
loss. The focus of these works was mainly on what and how to select an example 
for the next query after the initial training. 

The text bundling technique recently proposed by Shih et. al. [5] is a new 
method for condensing the text data. Their method divides texts belonging to 
the same category into groups of similar ones and then averages each group to 
generate a bundled text. They showed that text bundling is useful especially 
when we solve text classification tasks by using a support vector machine which 
is a highly accurate but time-consuming learning algorithm. The role of bundled 
texts is very similar to that of our model examples in that they help the learner 
to derive classifiers of higher classification accuracy. However, text bundling re- 
quires examples which are already labeled. 

5 Conclusions and Future Work 

In this paper, we presented a method of selecting initial training examples for 
active learning so that it can reach high performance faster with fewer further 
queries. The core of our method is the selection of representative and model 
examples from the given unlabeled data after dividing the data into clusters 
of similar examples. Experiments with various text data sets have shown that 
the active learner starting from the initial training set selected by our method 
reaches higher accuracy faster than that starting from randomly selected initial 
training set. 

As one of the future works, we would try to verify our method by applying it 
to various other classification tasks. We also plan to find a way to generalize our 
method to incorporate other clustering algorithms such as the EM algorithm. 
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Abstract. Clustering of natural text collections is generally difficult due 
to the high dimensionality, heterogeneity, and large size of text collec- 
tions. These characteristics compound the problem of determining the 
appropriate similarity space for clustering algorithms. In this paper, we 
propose to use the spectral analysis of the similarity space of a text 
collection to predict clustering behavior before actual clustering is per- 
formed. Spectral analysis is a technique that has been adopted across 
different domains to analyze the key encoding information of a system. 
Spectral analysis for prediction is useful in first determining the quality 
of the similarity space and discovering any possible problems the selected 
feature set may present. Our experiments showed that such insights can 
be obtained by analyzing the spectrum of the similarity matrix of a text 
collection. We showed that spectrum analysis can be used to estimate 
the number of clusters in advance. 



1 Introduction 

With the rapid growth of the World Wide Web, there is an abundance of text in- 
formation to general users. Text collections on the Web are more heterogeneous, 
exhibit more frequent changes in terms of content, size, and feature than classi- 
cal text collections. In clustering, one common preprocessing step is to capture 
or predict the characteristic of the target text collection before the clustering al- 
gorithm is applied. In this paper, we refer to this characteristic as the clustering 
behavior of a text collection. For instance, when clustering Web text collections 
that are heterogeneous and dynamic (frequent updates), prior knowledge of their 
characteristic is helpful in terms of determining the factors that may affect the 
clustering algorithm. In general, prediction can be thought as a snapshot (or 
rough view) of a text collection. In this sense, we refer to such a characteristic as 
the fingerprint of the target text collection. In this paper, our work is motivated 
by the investigation of the following two facts: 

— Function of similarity. Clustering is described or defined by many re- 
searchers as a process based on the similarity among objects. In most cases, 
when the similarities among documents have been computed, the clustering 
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results are also largely determined. Such pairwise similarity among docu- 
ments in a text collection can be recorded in a similarity matrix. Hence, the 
similarity matrix encodes the clustering characteristics of a text collection. 
Based on the function of similarity, it is feasible to establish the fingerprint 
of a text collection. 

~ Function of domain knowledge. The selection of feature sets, the feature 
weighting scheme, and the similarity measure can be viewed as the implicit 
or explicit domain knowledge in a clustering process [5]. For example, based 
on one’s experience, one may select a set of meaningful terms or phrases 
to construct the feature set for a text collection. This feature set would be 
more effective than other feature sets that are computed mechanically. The 
effectiveness of a feature set determines the efficacy of the similarity matrix 
in terms of revealing the expected relationships of documents [4,7,9]. 

The above two observations serve as guiding principles for our work in this 
paper, which aims to perform text clustering from another perspective. In this 
paper, we proposed the use of the normalized spectrum (the set of the eigenval- 
ues) of a similarity matrix as the fingerprint of a text collection. We discovered 
observations of the normalized spectrum of a target text collection that could 
establish the relationship between the similarity spectrum and the clustering 
behavior of the text collection. 

2 Spectral Analysis of Similarity Matrix 

2.1 Weighted Undirected Graph and Variant of Weighted Laplacian 

Given the similarity matrix S = (sij)nxm we define t/(S) = (V,E,S) as its 
associated graph where V is the set of n vertices and E is the set of weighted 
edges. In this graph, each vertex Vi corresponds to the i-th column (or row) and 
the weight of each edge vivj corresponds to the non-diagonal entry Sij . Therefore, 
we have established a mapping between S and t/(S). This mapping binds the 
clustering behaviors of S with the properties and structure of G{S). 

Due to the different scale of the spectrums of different S’s, it is difficult to 
analyze and compare them. Thus, we transform S to a weighted Laplacian L = 
{iij ) which has “normalized” eigenvalues with the same scale [1] . The eigenvalues 
of L lie in the interval [0, 2]. For ease of analysis and computation, we use L, a 
variant of the weighted Laplacian, instead of L, which is defined as: 

L = D-i/2(s_i)d-i/2 (1) 

where D = diag(d,) denotes the diagonal matrix {di = Y^jSij is the degree of 
the vertex Vi in graph C/(S)). From the definition of L [1] and Equations 1, we 
deduce eig(L) = {1 — A|A G eig(L)} (where eig(-) represents the set of eigenvalues 
of a matrix). Hence, the eigenvalues of L are in the interval [—1, 1] and all the 
conclusions and properties of the weighted Laplacian L are also applicable to L. 
In this paper, we use eigenvalues of a variant of the weighted Laplacian, eig(L), 
to perform the spectral analysis of the similarity matrix S. Given the similarity 
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matrix S, let Ai ^ A 2 ^ ^ A„ be the eigenvalues of t/(S) in non-increasing 

order. There are some basic facts of the t/(S): (1) ^ Ai = 0; (2) — 1 < A^ < 1, 
i = (3) Ai = 1. Proofs of these facts are not given as they can be 

found in spectral graph theory [1,2]. 

Suppose h is the number of non-zeros in S, then the computational complex- 
ity of this transformation is 0{h). the complexity of our method is 0(k{h + n)) 
using Lanczos method [2]. 



2.2 Some Observations about the ^(S) Spectrum 

Two observations of the t/(S) spectrum for the clustering behavior of S are 
given for the relationships between the clustering behavior of S and the principal 
properties and structure of t/(S). Due to the limitation of space, intuitive and 
theoretical explanations for these observations are not presented in this paper. 

Observation 1. Suppose the similarity matrix S has the block structure: 



S = 



Sii ••• Sifc 
Sfcl ' ' ' Sfc/e 

ni ■■■ rik 



ni 



nk 



( 2 ) 



where ni -I- • • • -I- n*, = n, for each diagonal block matrix S^i, they satisfy 0 < 
n-i — IjSiillF < — >■ 0). For each non-diagonal block matrix Sij, they satisfy 

0 (II • IIf is the Frobenius norm). S shows a good clustering behavior 
with n clusters. The spectrum ofQ(S) is 



Ai — >■ 1 (i = 1, • • • , and 0 < Ai ^ 1) 

|Aj| — 0 (i = /c -I- 1, • • • , n) 

In this observation, the above two conditions guarantee the extremely 
good clustering behavior of S with intra-similarities approaching 1 and inter- 
similarities approaching 0 among clusters. The spectrum of t/(S) shows the typ- 
ical distribution in (3). However, for text collections in the real world, the spec- 
trum of C/(S) shows complex distribution that is greatly different from that of the 
extreme case. The following is the observation about the large text collection. 

Observation 2. Suppose 1 = Ai ^ A 2 ^ ^ A„ &e the I/(S) spectrum we 

have 

(1) The distribution of the spectrum exhibits a polynomial of high odd order, and 
most of eigenvalues are centered on the near neighbor of 0. 

(2) If X 2 is higher, there exists a better bipartition for S. 

(3) For the sequence ai = ^ (i ^ 2), ^ 2, it has Oi — ?► 1 and ai — a^+i > 5 

(0 < i5 < 1), then k indicates the cluster number of the text collection. (Here, 
(5 is a predefined threshold to measure the first large gap between ai) 

(4) If the curve of the spectrum is closer to the x-axis, the clustering behavior 
of S is worse. (This closeness can be measured by 
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Fig. 1. Part of Spectrum and Gray Scale Images on Three Text Collections 
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3 Experimental Results 

Three text collections were used in experiments: (1) “WEB2” is from two cat- 
egories in Yahoo with each category containing 600 documents [8]. (2) “CLAS- 
SIC4” is from four classes of traditional text collections in IR area (CACM/CISI/ 
CRAN/MED) with each class containing 1,000 documents. (3) “NEWS5” is from 
five newsgroups of UseNet news postings with each newsgroup containing 500 
documents. In this experiment, we investigated the prediction of cluster number 
to test the viability of spectral analysis on text collections. 

Spectral Analysis. Specifically, we estimated the number of clusters in text 
collection by Observation 2(3). The analysis starts from A 2 , as Ai is always 1. 
From Figure 1, we found and circled the largest eigenvalues with the largest 
gaps. It can be observed by the stair steps among the eigenvalues in Figure 1. 
According to Observation 2(3), the number of these large eigenvalues (including 
Ai) with wide gaps indicates the number of clusters. We can verify it by analyzing 
their gray scale images and class topics. 

WEB2 has two completely different topics, which can be seen in its gray 
scale image (Figure 1(a)). Observing the spectrum of its similarity matrix (Fig- 
ure 1(a)), A 2 has a higher value than other eigenvalues. The rest of eigenval- 
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ues fall along a smooth gentle line rather than a steep line. This phenomenon 
conforms to Observation 2(3). Hence, in this case, k = 2. For CLASSIC4, sim- 
ilar spectral analysis can be made from its spectrum (Figure 1(b)). The result 
shows that this collection has k = 4 clusters. Unlike the previous two collections 
with disjoint topics, NEWS5 has five overlapping topics: “atheism”, “comp.sys”, 
“comp. windows”, “misc.forsale” and “rec. sport”. They are exhibited by the gray 
scale image (Figure 1(c)). A 2 , A 3 , and A 4 have higher values and wider gaps than 
other eigenvalues. It indicates that there are fc = 4 clusters in this collection. 
The topics “comp.sys” and “comp. windows” can be viewed as one topic “comp” 
which are more different from other three topics. From this point of view, the 
estimation of four clusters is a reasonable result. 

Comparison with Existing Methods. Traditional methods for estimating 
the number of cluster typically run a clustering algorithm many times with pre- 
defined cluster numbers k from 2 to k^ax- Then the optimum k is obtained 
by an internal index based on those clustering results. Their difference is the 
index they used. In this experiment, we computed three widely used statistical 
indices [3]: Calinski and Harabasz (CH), Krzanowski and Lai (KL) and Hartigan 
(Hart), based on three clustering algorithms [ 6 ]. In Table 1, our method remark- 
ably outperforms these three methods. It may be due to the dependence of the 
specific clustering algorithms and their ineffectiveness for high dimensional data. 
In contrary, our method is independent of the specific clustering algorithms, is 
suitable for high-dimensional data sets, and has low complexity. 
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Abstract. Traditional clustering algorithms are based on one represen- 
tation space, usually a vector space. However, in a variety of modern 
applications, multiple representations exist for each object. Molecules 
for example are characterized by an amino acid sequence, a secondary 
structure and a 3D representation. In this paper, we present an efficient 
density-based approach to cluster such multi-represented data, taking 
all available representations into account. We propose two different tech- 
niques to combine the information of all available representations depen- 
dent on the application. The evaluation part shows that our approach is 
superior to existing techniques. 



1 Introduction 

In recent years, the research community spent a lot of attention to clustering 
resulting in a large variety of different clustering algorithms [1]. However, all 
those methods are based on one representation space, usually a vector space 
of features and a corresponding distance measure. But for a variety of modern 
applications such as biomolecular data, CAD- parts or multi-media files mined 
from the internet, it is problematic to find a common feature space that incorpo- 
rates all given information. Molecules like proteins are characterized by an amino 
acid sequence, a secondary structure and a 3D representation. Additionally, pro- 
tein databases such as Swissprot [2] provide meaningful text descriptions of the 
stored proteins. In CAD-catalogues, the parts are represented by some kind of 
3D model like Bezier curves, voxels or polygon meshes and additional textual 
information like descriptions of technical and economical key data. We call this 
kind of data multi-represented data, since any data object might provide several 
different representations that may be used to analyze it. 

To cluster multi-represented data using the established clustering methods 
would require to restrict the analysis to a single representation or to construct a 
feature space comprising all representations. However, the restriction to a single 
feature space would not consider all available information and the construction 
of a combined feature space demands great care when constructing a combined 
distance function. 

In this paper, we propose a method to integrate multiple representations di- 
rectly into the clustering algorithm. Our method is based on the density-based 
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clustering algorithm DBSCAN [3] that provides several advantages over other 
algorithms, especially when analyzing noisy data. Since our method employs a 
separated feature space for each representation, it is not necessary to design a 
new suitable distance measure for each new application. Additionally, the han- 
dling of objects that do not provide all possible representations is integrated 
naturally without defining dummy values to compensate for the missing repre- 
sentations. Last but not least, our method does not require a combined index 
structure, but benefits from each index that is provided for a single representa- 
tion. Thus, it is possible to employ highly specialized index structures and filters 
for each representation .We evaluate our method for two example applications. 
The first is a data set consisting of protein sequences and text descriptions. Ad- 
ditionally, we applied our method to the clustering of images retrieved from the 
internet. For this second data set, we employed two different similarity models. 

The rest of the paper is organized as follows. After this introduction, we 
present related work. Section 3 formalizes the problem and introduces our new 
clustering method. In our experimental evaluation that is given in section 4, 
we introduce a new quality measure to judge the quality of a clustering with 
respect to a reference clustering and display the results achieved by our method 
in comparison with the other mentioned approaches. The last section summarizes 
the paper and presents some ideas for future research. 



2 Related Work 

There are several problems that are closely related to the clustering of multi- 
represented data. Data mining of multi-instance objects [4] is based on the pre- 
condition that each data object might be represented by more than one instance 
in a common data space. However, all instances that are employed are elements 
of the same data space and multi-instance objects were predominantly treated 
with respect to classification not to clustering. 

A similar setting to the clustering of multi-represented objects is the cluster- 
ing of heterogenous or multi-typed objects [5,6] in web mining. In this setting, 
there are also multiple databases each yielding objects in a separated data space. 
Each object within these data spaces may be related to an arbitrary amount of 
data objects within the other data spaces. The framework of reinforcement clus- 
tering employs an iterative process based on an arbitrary clustering algorithm. 
It clusters one dedicated data space while employing the other data spaces for 
additional information. It is also applicable for multi-represented objects. How- 
ever, due to its dependency on the data space for which the clustering is started, 
it is not well suited to solve our task. Since to the best of our knowledge rein- 
forcement clustering is the only other clustering algorithm directly applicable to 
multi-represented objects, we use it for comparison in our evaluation section. 

Our approach is based on the formal definitions of density-connected sets 
underlying the algorithm DBSCAN [3]. Based on two input parameters {e and 
k), DBSCAN defines dense regions by means of core objects. An object o € DB 
is called core object, if its £-neighborhood contains at least k objects. Usually 



396 K. Railing et al. 



clusters contain several core objects located inside a cluster and border objects 
located at the border of the cluster. In addition, objects within a clusters must 
be “density-connected”. DBSCAN is able to detect arbitrarily shaped clusters 
by one single pass over the data. To do so, DBSCAN uses the fact, that a 
density-connected cluster can be detected by finding one of its core-objects o 
and computing all objects which are density-reachable from o. The correctness 
of DBSCAN can be formally proven (cf. lemmata 1 and 2 in [3], proofs in [7]). 



3 Clustering Multi-represented Objects 

Let DB be a database consisting of n objects. Let R := {i?i, ..., i?™} be the 
set of different representations existing for objects in DB. Each object o G 
DB is therefore described by maximally m different representations, i.e. o := 
{i?i(o), i? 2 (o)j Rm{o)}. If all different representations exist for o, than |o| = m, 

else |o| < m. The distance function is denoted by dist. We assume that dist is 
symmetric and reflexive. In the following, we call the £i-neighborhood of an 
object o in one special representation Ri its local e-neighborhood w.r.t. Ri. 

Definition 1 (local ei-neighborhood w.r.t Ri ). 

Let o G DB, e^ G , Ri G R, disti the distance function of Ri. The local 
ei-neighborhood w.r.t. Ri of o, denoted by is defined by 

= {x £ DB I disti{Ri{o) , Ri{x)) < Si}. 

Note that £i can be chosen optimally for each representation. The simplest way of 
clustering multi-represented objects, is to select one representation Ri and cluster 
all objects according to this representation. However, this approach restricts 
data analysis to a limited part of the available information and does not use the 
remaining representations to find a meaningful clustering. Another way to handle 
multi-represented objects is to combine the different representations and use a 
combined distance function. Then any established clustering algorithm can be 
applied. However, it is very difficult to construct a suitable combined distance 
function that is able to fairly weight each representation and handle missing 
values. Furthermore, a combined feature space, does not profit from specialized 
data access structures for each representation. 

The idea of our approach is to combine the information of all different repre- 
sentations as early as possible, i.e. during the run of the clustering algorithm, and 
as late as necessary, i.e. after using the different distance functions of each rep- 
resentation. To do so, we adapt the core object property proposed for DBSCAN. 
To decide whether an object is a core object, we use the local e-neighborhoods of 
each representation and combine the results to a global neighborhood. Therefore, 
we must adapt the predicate direct density-reachability proposed for DBSCAN. 
In the next two subsections, we will show how we can use the concepts of union 
and intersection of local neighborhoods to handle multi-represented objects. 
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Fig. 1. The left figure displays local clusters and a noise object that are aggregated to 
a multi-represented cluster C. The right figure illustrates, how the intersection-method 
divides a local clustering into clusters Ci and C 2 . 



3.1 Union of Different Representations 

This variant is especially useful for sparse data. In this setting, the clusterings in 
each single representation will provide several small clusters and a large amount 
of noise. Simply enlarging e would relief the problem, but on the other hand, the 
separation of the clusters would suffer. The union-method assigns objects to the 
same cluster, if they are similar in at least one of the representations. Thus, it 
keeps up the separation of local clusters, but still overcomes the sparsity. If the 
object is placed in a dense area of at least one representation, it is still a core 
object regardless of how many other representations are missing. Thus, we do 
not need to define dummy values. The left part of figure 1 illustrates the basic 
idea. We adapt some of the definitions of DBSCAN to capture our new notion 
of clusters. To decide whether an object o is a union core object, we unite all 
local £i-neighborhoods and check whether there are enough objects in the global 
neighborhood, i.e. whether the global neighborhood of o is dense. 

Definition 2 (union core object). 

Let £i,£ 2 , S , k € IN. An object o € DB is called union core object, 

denoted by COREUg^ (f ^^6 union of all local e-neighborhoods contains 

at least k objects, formally: 

CoreU^^_ I U J\f^^{o)\>k. 

Ri{o)^o 



Definition 3 (direct union-reachability). 

Let £i,£ 2 ,..,£m £ , k £ IN. An object p £ DB is directly union-reachable 

from q £ DB if q is a union core object and p is an element of at least one local 
Af^'{q), formally: 

DiRREACHUgj _g^(( 7 ,p) COREUg^_ g^(g)A3 i £ {1, ..,to} : Ri{p) £ Af^'{q). 

The predicate direct union-reachability is obviously symmetric for pairs of 
core objects, because the disti are symmetric distance functions. Thus, analo- 
gously to DBSCAN reachability and connectivity can be defined. 
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3.2 Intersection of Different Representations 

The intersection method is well suited for data containing unreliable represen- 
tations, i.e. there is a representation, but it is questionable, whether it is a good 
description of the object. In those cases, the intersection-method requires that 
a cluster should contain only objects which are similar according to all repre- 
sentations. Thus, this method is useful, if all different representations exist, but 
the derived distances do not adequately mirror the intuitive notion of similarity. 
The intersection-method is used to increase the cluster quality by finding purer 
clusters. 

To decide, whether an object o is an intersection core object, we examine, 
whether o is a core object in each involved representation. Of course, we use dif- 
ferent e- values for each representation to decide, whether locally there are enough 
objects in the e-neighborhood. The parameter k is used to decide, whether glob- 
ally there are still enough objects in the e-neighborhood, i.e. the intersection of 
all local neighborhoods contains at least k objects. 

Definition 4 (intersection core object). 

Let ei, £ 2 , ■■■, Sm S JR~^, k € IN. An object o € DB is called intersection 
core object, denoted by CoreIS^j^ intersection of all its local £i~ 

neighborhoods contain at least k objects, formally: 

CORElS^^_.._,^(o) O I Pi I > k. 

2=1,. .,m 

Using this new property, we can now define direct intersection-reachability 
in the following way: 

Definition 5 (direct intersection-reachability). 

Let £i, £ 2 , ..., £m G IV' , k £ IN. An object p £ DB is directly intersection- 
reachable from q £ DB if q is an intersection core object and p is an element of 
all local Afe{q), formally: 

DirReachIS^^ ^,,^(g,p) CoreIS^^_ ,,^(g) A Vi = 1, .., m : i?i(p) €Af^'{q) . 

Again, reachability and connectivity can be defined analogously to DBSCAN. 
The right part of figure 1 illustrates the effects of this method. 



3.3 Determination of Density Parameters 

In [3], a heuristic is presented to determine the £- value of the ’’thinnest” clus- 
ter in the database. This heuristic is based on a diagram that represents sorted 
fcnn-distances of all given objects. In the case of multi-represented objects, we 
have to choose e for each dimension separately, whereas k can be chosen globally. 
A user determines a value for global k. The system computes the fcnn-distance 
diagrams for the given global k (one diagram for every representation) . The user 
has to choose a so-called border object o for each representation. The £ for the 
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i-th representation is given by the fcrm-distance of the border object of Ri. Let 
us note that this method still allows a certain range of e-values to be chosen. The 
selection should mirror the different requirements of the proposed methods. For 
the union method, it is more advisable to chose a lower or conservative value, 
since its characteristic demands that the elements of the local e-neighborhood 
should really be similar. For the intersection-method, the e-value should be se- 
lected progressively, i.e. at the upper rim of the range. This selection reflects 
that the objects of a cluster need not be too similar for a single representation, 
because it is required that they are similar with respect to all representations. 

4 Performance Evaluation 

To demonstrate the capability of our method, we performed a thorough experi- 
mental evaluation for two types of applications. We implemented the proposed 
clustering algorithm in Java 1.4. All experiments were processed on a work sta- 
tion with a 2.6 GHz Pentium IV processor and 2 GB main memory. 

4.1 Deriving Meaningful Groupings in Protein Databases 

The first set of experiments was performed on protein data that is represented 
by amino-acid sequences and text descriptions. Therefore, we employed entries 
of the Swissprot protein database [2] belonging to 5 functional groups (cf. Table 
1) and transformed each protein into a pair of feature vectors. Each amino acid 
sequence was mapped into a 436 dimensional feature space. The first 400 features 
are 2-grams of successive amino-acids. The last 36 dimensions are 2-grams of 6 ex- 
change groups that the single amino-acids belong to [8] . To compare the derived 
feature vectors, we employed Euclidian distance. To process text documents, we 
rely on projecting the documents into the feature space of relevant terms. Doc- 
uments are described by a vector of term frequencies weighted by the inverse 
document frequency (TFIDF) [9]. We chose 100 words of medium frequency as 
relevant terms and employed cosine distance to compare the TFIDF-vectors. 
Since Swissprot entries provide a unique mapping to the classes of Gene Ontol- 
ogy [10], a reference clustering for the selected proteins was available. Thus, we 
are able to measure a clustering of Swissprot entries by the degree it reproduces 
the class structure provided by Gene Ontology. 

To have an exact measure for this degree, we employed the class entropy in 
each cluster. However, there are two effects that have to be considered to obtain a 
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Fig. 2. Clustering quality and noise ratio. 



fair measure of a clustering with noise. First, a large cluster of a certain entropy 
should contribute more to the overall quality of the clustering than a rather 
small cluster providing the same quality. The second effect is that a clustering 
having a 5 % noise ratio should be ranked higher than a clustering having the 
same average entropy for all its clusters, but contains 50 % noise. 

To consider both effects we propose the following quality measure for com- 
paring different clusterings with respect to a reference clustering. 

Definition 6. Let O be the set of data objects, let C = {Ci\Ci C O} be the set 
of clusters and let K = {Ki\Ki C O} be the reference clustering of O. Then we 
define: 

ir-i 

Qk(C) = ^ ■ (1 -k entropyK(Ci)) 

Ci&C ' ' 

where entropy xiCi) denotes the entropy of cluster Ci with respect to K. 

The idea is to weight every cluster by the percentage of the complete data 
objects being part of it. Thus, smaller clusters are less important than larger 
ones and a clustering providing an extraordinary amount of noise can contribute 
only the percentage of clustered objects to the quality. Let us note that we add 
1 to the cluster entropies. Therefore, we measure the reference clustering K with 
the quality score of 1 and a worst case clustering ~ e.g. no clusters are found 
at all- with the score of 0. To relate the quality of the clustering achieved by 
our methods to the results of former methods, we compared it to 4 alternative 
approaches. First, we clustered text and sequences separately using only one 
of the representations. A second approach combines the features of both rep- 
resentations into a common feature space and employs the cosine distance to 
relate the resulting feature vectors. As the only other clustering method that is 
able to handle multi-represented data, we additionally compared reinforcement 
clustering using DBSCAN as underlying cluster algorithm. For reinforcement 
clustering, we ran 10 iterations and tried several values of the weighting param- 
eter a. The local e-parameters were selected as described above and we chose 
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Fig. 3. Example of an image cluster. The left rectangle contains images clustered by the 
intersection-method. The right rectangles display additional images that were grouped 
with the corresponding cluster when clustering the images with respect to a single 
representation. 



/c = 2. To consider the different requirements of both methods, for each data set 
a progressive and a conservative e- value was determined. All approaches were 
run for both settings and the best results are displayed. 

The left diagram of figure 2 displays the derived quality for those 4 methods 
and the two variants of our method. In all five test sets, the union- method using 
conservative e-values outperformed any of the other algorithms. Furthermore, 
the noise ratio for each data set was between 16% and 28% (cf. figure 2, right), 
indicating that the main portion of the data objects belongs to some cluster. 
The intersection method using progressive e-parameters performed comparably 
well, but was to restrictive to overcome the sparseness of the data as good as 
the union- method. 



4.2 Clustering Images by Multiple Representations 

Clustering image data is a good example for the usefulness of the intersection- 
method. A lot of different similarity models exists for image data, each having 
its own advantages and disadvantages. Using for example text descriptions of 
images, one is able to cluster all images related to a certain topic, but these 
images must not look alike. Using color histograms instead, the images are clus- 
tered according to the distribution of color in the image. But as only the color 
information is taken into account a green meadow with some flowers and a green 
billiard table with some colored shots on it, can of course not be distinguished 
by this similarity model. On the other hand, a similarity model taking content 
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information into account might not be able to distinguish images of different 
colors. 

Our intersection approach is able to get the best out of all these different 
types of representations. Since the similarity in one representation is not really 
sound, the intersection-method is well-suited to find clusters of better quality 
for this application. For our experiments, we used two different representations. 
The first representation was a 64-dimensional color histogram. In this case, we 
used the weighted distance between those color histograms, represented as a 
quadratic form distance function as described for example in [11]. The second 
representation were segmentation trees. An image was first divided into seg- 
ments of similar color by a segmentation algorithm. In a second step, a tree was 
created from those segments by iteratively applying a region-growing algorithm 
which merges neighboring segments, if their colors are alike. In [12] an efficient 
technique is described to compute the similarity between two such trees using 
filters for the complex edit-distance measure. 

As we do not have any class labels to measure the quality of our cluster- 
ing, we can only describe the results we achieved. In general, the clusters we 
got using both representations were more accurate than the clusters we got us- 
ing each representation separately. Of course, the noise ratio increased for the 
intersection-method. Due to space limitations we only show one sample clus- 
ter of images we found with the intersection-method (see Figure 3). Using this 
method, very similar images are clustered together. When clustering each single 
representation, a lot of additional images were added to the corresponding clus- 
ter. As one can see, using the intersection-method only the most similar images 
of both representations still belong to the cluster. 

5 Conclusions 

In this paper, we discussed the problem of clustering multi-represented objects. 
A multi-represented object is described by a set of representations where each 
representation belongs to a different data space. Contrary to existing approaches 
our proposed method is able to cluster this kind of data using all available repre- 
sentations without forcing the user to construct a combined data space. The idea 
of our approach is to combine the information of all different representations as 
early as possible and as late as necessary. To do so, we adapted the core object 
property proposed for DBSCAN. To decide whether an object is a core object, 
we use the local e-neighborhoods of each representation and combine the results 
to a global neighborhood. Based on this idea, we proposed two different methods 
for varying applications. For sparse data, we introduced the union-method that 
assumes that an object is a core object, if k objects are found within the union of 
its local e-neighborhoods. Respectively, we defined the intersection-method for 
data where each local representation yields rather big and unspecific clusters. 
Therefore, the intersection-method requires that at least k objects are within 
the intersection of all local £-neighborhoods of a core object. In our experimen- 
tal evaluation, we introduced an entropy based quality measure that compares 
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a given clustering with noise to a reference clustering. Employing this quality 
measure, we demonstrated that the union method was most suitable to over- 
come the sparsity of a given protein data set. To demonstrate the ability of the 
intersection method to increase the cluster quality, we applied it to a set of im- 
ages using two different similarity models. For future work, we plan to examine 
applications providing more than two representations. We are especially inter- 
ested, in clustering proteins with respect to all of the mentioned representations. 
Another interesting challenge is to extend our method to an multi-instance and 
multi-representation clustering. In this setting each object may be represented 
by several instances in some of the representations. 
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Abstract. Given a point query Q in multi-dimensional space, K-Nearest 
Neighbor (KNN) queries return the K closest answers in the database with 
respect to Q. In this scenario, it is possible that a majority of the answers may 
be very similar to one or more of the other answers, especially when the data 
has clusters. For a variety of applications, such homogeneous result sets may 
not add value to the user. In this paper, we consider the problem of providing 
diversity in the results of KNN queries, that is, to produce the closest result set 
such that each answer is sufficiently different from the rest. We first propose 
a user-tunable definition of diversity, and then present an algorithm, called 
MOTLEY, for producing a diverse result set as per this definition. Through a 
detailed experimental evaluation we show that MOTLEY can produce diverse 
result sets by reading only a small fraction of the tuples in the database. Further, 
it imposes no additional overhead on the evaluation of traditional KNN queries, 
thereby providing a seamless interface between diversity and distance. 

Keywords: Nearest Neighbor, Distance Browsing, Result Diversity 



1 Introduction 

Over the last few years, there has been considerable interest in the database community 
with regard to supporting K-Nearest Neighbor (KNN) queries [8]. The general model 
of a KNN query is that the user gives a point query in multidimensional space and a 
distance metric for measuring distances between points in this space. The system is then 
expected to find, with regard to this metric, the K closest answers in the database from the 
query point. Typical distance metrics include Euclidean distance, Manhattan distance, 
etc. 

It is possible that a majority of the answers to a KNN query may be very similar to 
one or more of the other answers, especially when the data has clusters. In fact, there may 
even be duplicates w.r.t. the attributes of the multidimensional space. For a variety of 
applications, such as online restaurant selection [6], providing homogeneous result sets 
may not add value to the user. It is our contention in this paper that, for such applications, 
the user would like to have not just the closest set of answers, but the closest diverse set 
of answers. 

Based on the above motivation, we consider here the problem of providing diversity 
in the results of KNN queries, that is, to produce the closest result set such that each 

* Contact Author: haritsa@dsl.serc.iisc.emet.in 
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answer is sufficiently diverse from the rest. We hereafter refer to this problem as the 
K-Nearest Diverse Neighbor (KNDN) problem, which to the best of our knowledge has 
not been previously investigated in the literature, and cannot be handled by traditional 
clustering techniques (details in [6]). 

An immediate question that arises is how to define diversity. This is obviously a 
user-dependent choice, so we address the issue by providing a tunable definition, which 
can be set with a single parameter, MinDiv, by the user. MinDiv values range over [0,1] 
and specify the minimum diversity that should exist between any pair of answers in the 
result set. Setting MinDiv to zero results in the traditional KNN query, whereas higher 
values give more and more importance to diversity at the expense of distance. 

Finding the optimal result set for KNDN queries is an NP-cornplete problem in gen- 
eral, and is computationally extremely expensive even for fixed K, making it infeasible 
in practice. Therefore, we present here an online algorithm, called MOTLEY' , for pro- 
ducing a sufficiently diverse and close result set. MOTLEY adopts a greedy heuristic and 
leverages the existence of a spatial-containment-based multidimensional index, such as 
the R-tree, which is natively available in today’s commercial database systems [7]. The 
R-tree index supports a “distance browsing” mechanism proposed in [5] through which 
database points can be efficiently accessed in increasing order of their distance from the 
query point. A pruning technique is incorporated in MOTLEY to minimize the R-tree 
processing and the number of database tuples that are examined. 

Through a detailed experimental evaluation on real and synthetic data, we have found 
that MOTLEY can produce a diverse result set by reading only a small fraction of the 
tuples in the database. Further, the quality of its result set is very close to that provided 
by an off-line brute-force optimal algorithm. Finally, it can also evaluate traditional 
KNN queries without any added cost, thereby providing a seamless interface between 
the orthogonal concepts of diversity and distance. 



2 Basic Concepts and Problem Formulation 

In the following discussion, for ease of exposition and due to space limitations, we focus 
on a restricted instance of the KNDN problem - a significantly more general formulation 
is available in the full version of the paper [6]. 

We model the database as composed of N tuples over a D-dimensional space with 
each tuple representing a point in this space^. The domains of all attributes are numeric 
and normalized to the range [0,1]. The user specifies a point query Q over an M-sized 
subset of these attributes (M < D). We refer to these attributes as “point attributes”. The 
user also specifies K, the number of desired answers, and a L-sized subset of attributes 
on which she would like to have diversity (L < D). We refer to these attributes as 
“diversity attributes” and the space formed by diversity attributes as diversity-space. 
Note that the choice of the diversity attributes is orthogonal to the choice of the point 
attributes. Finally, the result is a set of K database points. 

Given that there are N points in the database and that we need to select K points for 
the result set, there are ^ Ck possible choices. We apply the diversity constraints first 

* Motley: A collection containing a variety of sorts of things [9]. 

^ We use point and tuple interchangeably in the remainder of this paper 
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to determine the feasible sets and then bring in the notion of distance from the query 
point to make a selection from these sets. Viewed abstractly, we have a two-level scoring 
function: The first level chooses candidate result sets based on diversity constraints, and 
the second level selects the result set that is spatially closest to the query point. 



Result Diversity. We begin by defining point diversity and then, since the result is 
viewed as a set, extend the definition to set diversity. Point diversity is defined with 
regard to a pair of points and is evaluated with respect to the diversity attributes, V (Q), 
mentioned in the query. Specifically, given points Pi, P2 and V (Q), the function DIV 
{Pi, P2,V{Q)) returns true if Pi and P2 are diverse with respect to each other on the 
specified diversity attributes. A sample DIV function is described later in this section. 

For a set to he. fully diverse, all the points in the set should be mutually diverse. That 
is, given a result set TZ with points Ri, R2, ■ ■ ■ , Rk, we require DIV {Ri, Rj, V {Q)) 
= true V i, j such that i ^ j and 1 < i,j < K. For the restricted scenario considered 
here, we assume that at least one fully diverse result set is always available for the user 
query. 



Diversity Function. Our computation of the diversity between two points Pi and P2, is 
based on the classical Gower coefficient [ 2 ], wherein the difference between two points 
is defined as a weighted average of the respective attribute differences. Specifically, 
we first compute the differences between the attributed values of these two points in 
diversity-space, sequence these differences in decreasing order of their values, and then 
label them as (i5i, ^ 2 , ■ • • , <5l). Now, we calculate divdist, the diversity distance between 
points Pi and P2 with respect to diversity attributes V{Q) as 

L 

divdist{Pi,P2,V{Q)) = X (1) 

J=1 

where the Wj’s are weighting factors for the differences. Since all Sj’s are in the range 
[0,1] (recall that the values on all dimensions are normalized to [0,1]), and by virtue of 
the Wj assignment policy discussed below, diversity distances are also bounded in the 
range [0,1]. 

The assignment of the weights is based on the heuristic that larger weights should 
be assigned to the larger differences. That is, in Equation 1, we need to ensure that 
Wi > Wj ifi<j (recall that Sj’s are sorted in decreasing order.) The rationale for this 
assignment is as follows: Consider the case where point Pi has values (0.2, 0.2, 0.3), 
point P 2 has values (0.19, 0.19, 0.29) and point P3 has values (0.2, 0.2, 0.27). Consider 
the diversity of Pi with respect to P 2 and P3. While the aggregate difference is the same 
in both cases, yet intuitively we can see that the pair (Pi, P 2 ) is more homogeneous as 
compared to the pair (Pi , P3). This is because Pi and P3 differ considerably on the third 
attribute as compared to the corresponding differences between Pi and P 2 . 

Now consider the case where P3 has value (0.2, 0.2, 0.28). Here, although the ag- 
gregate Sj is higher for the pair (Pi, P2), yet again it is pair (Pi, P3) that appears more 
diverse since its difference on the third attribute is larger than any of the individual 
differences in pair (Pi, P2). 
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Based on the above discussion, the weighting function should have the following 
properties: Firstly, all weights should be positive, since having difference in any dimen- 
sion should never decrease the diversity. Second, the sum of the weights should add up 
to 1 (i.e., = 1) ensure that divdist values are normalized to the [0,1] range. 

Finally, the weights should be monotonically decaying (Wi > Wj if i < j ) to reflect 
the preference given to larger differences. 



Example 1 . A candidate weighting function that obeys the above requirements is the 
following: 






(2) 



where a is a tunable parameter over the range (0,1). Note that this function implements 
a geometric decay, with the parameter ‘a’ determining the rate of decay. Values of a 
that are close to 0 result in faster decay, whereas values close to 1 result in slow decay. 
When the value of a is nearly 0, almost all weight is given to maximum difference i.e., 
Wi ~ 1, modeling (in the language of vector p-norms) the Loo (i-e-, Max) distance 
metric, and when a is nearly 1, all attributes are given similar weights, modeling a L\ 
(i.e., Manhattan) distance metric. 



Minimum Diversity Threshold. We expect that the user provides a quantitative notion 
of the minimum diversity distance that she expects in the result set through a threshold 
parameter MinDiv that ranges between [0,1]^. Given this threshold setting, two points 
are diverse if the diversity distance between them is greater than or equal to MinDiv. 
That is, D/V(Pi, P 2 , 1^(Q)) = true iff divdist{Pi, P2,V{Q)) > MinDiv. 

The physical interpretation of the MinDiv value is that if a pair of points are deemed 
to be diverse, then these two points have a difference of MinDiv or more on atleast one 
diversity dimension. For example, a MinDiv of 0. 1 means that any pair of diverse points 
differ in atleast one diversity dimension by atleast 10% of the associated domain size. 
This physical interpretation can guide the user in determining the appropriate setting of 
MinDiv. In practice, we would expect that MinDiv settings would be on the low side, 
typically not more than 0.2. As a final point, note that with the above formulation, the 
DIV function is symmetric with respect to the point pair {Pi, ^ 2 }- However, it is not 
transitive in that even if DIV (Pi, P2, V (Q)) and DIV (P2, P3, V (Q)) are both true, it 
does not imply that DIV (Pi, P3, V (Q)) is true. 



Integrating Diversity and Distance. Let function SpatialDist{P, Q) calculate the 
spatial distance of point P from query point Q (this distance is computed with regard to 
the point attributes specified in Q.) The choice of SpatialDist function is based on the 
user specification and could be any monotonically increasing distance function such as 
Euclidean, Manhattan, etc. We combine distances of all points in a set into a single value 
using an aggregate function Agg which captures the overall distance of the set from Q. 
While a variety of aggregate functions are possible, the choice is constrained by the fact 

^ This is similar to the user specifying minimum support and minimum confidence in association 
rule mining to determine what constitutes interesting correlations. 
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that the aggregate function should ensure that as the points in the set move farther away 
from the query, the distance of the set should also increase correspondingly. Sample 
aggregate functions which obey this constraint include the Arithmetic, Geometric, and 
Harmonic Means. 

Finally, we use the reciprocal of the aggregate of the spatial distances of the result 
points from the query point to determine the score of the (fully diverse) result set. (Note 
that the MinDiv threshold only determines the identities of the fully diverse result sets, 
but not their scores.) Putting all these formulations together, given a query Q and a 
candidate fully diverse result set TZ with points Ri, R 2 , ■ ■ ■ , Rk, the score of TZ with 
respect to Q is computed as 

^ Agg{SpatialDist{Q, Ri), . . . , SpatialDist{Q, R]^)) ^ ^ 

Problem Formulation. In summary, our problem formulation is as follows: 

Given a point query Q on a D-dimensional database, a desired result cardinality 
of K, and a MinDiv threshold, the goal of the K-Nearest Diverse Neighbor (KNDN) 
problem is to find the set of K mutually diverse tuples in the database, whose score, as 
per Equation 3, is the maximum, after including the nearest tuple to Q in the result set. 

The requirement that the nearest point to the user’s query should always form part of 
the result set is because this point, in a sense, best fits the user’s query. Further, the nearest 
point Ri serves to seed the result set since the diversity function is meaningful only for 
a pair of points. Since point i?i of the result is fixed, the result sets are differentiated 
based on their remaining K — 1 choices. 

An important point to note here is that when MinDiv is set to zero, all points (including 
duplicates) are diverse with respect to each other and hence the KNDN problem reduces 
to the traditional KNN problem. 

3 The MOTLEY Algorithm 

Finding the optimal result set for the KNDN problem is computationally hard. We 
can establish this (proof in [6]) by mapping KNDN to the well known independent 
set problem [3], which is NP-complete. Therefore, we present an alternative algorithm 
here called MOTLEY, which employs a greedy selection strategy in combination with 
a distance-browsing-based accessing of points; our experimental evaluation, presented 
later in Section 4, shows that the result sets obtained are extremely close to the optimal 
solution. 

3.1 Distance Browsing 

In order to process database tuples (i.e., points) incrementally, we adopt the “distance 
browsing” approach proposed in [5], through which it is possible to efficiently access 
data points in increasing order of their distance from the query point. This approach is 
predicated on having a containment-based index structure such as the R-Tree[4], built 
collectively on all dimensions of the database (more precisely, the index needs to cover 
only those dimensions on which point predicates may appear in the query workload.) 
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To implement distance browsing, a priority queue, pqueue, is maintained which is 
initialized with the root node of the R-Tree. The pqueue maintains the R-Tree nodes and 
data tuples in increasing order of their distance from the query point. While the distance 
between a data point and the query Q is computed in the standard manner, the distance 
between a R-tree node and Q is computed as the minimum of the distances between Q 
and all points in the region enclosed by the MBR (Minimum Bounding Rectangle) of the 
R-tree node. The distance of a node from Q is zero if Q is within the MBR of that node, 
otherwise it is the distance of the closest point on the MBR periphery. For this, we first 
need to compute the distances between the MBR and Q along each query dimension - 
if Q is inside the MBR on a specihc dimension, the distance is zero, whereas if Q is 
outside the MBR on this dimension, it is the distance from Q to either the low end or 
the high end of the MBR, whichever is nearer. Once the distances along all dimensions 
are available, they are combined (based on the distance metric in operation) to get the 
effective distance. 

Example!. Consider an MBR, M, specihed by ((1,1,1), (3, 3, 3)) in a 3-D space. Let 
Pi (2, 2, 2) andP 2 ( 4 , 2, 0) be two datapoints in this space. Then, SpatialDist{M, Pi) = 
Vifi+W+W = 0 and SpatialDist{M, P 2 ) = v^(4 - 3)2 -p Q2 -f (0 - 1)2 = 
1.414. 

To return the next nearest neighbor, we pick up the first element of the pqueue. If it 
is a tuple, it is immediately returned as next nearest neighbor. However, if the element is 
an R-tree node, all the children of that node are inserted in the pqueue. Note that during 
this insertion process, the spatial distance of the object from the query point is calculated 
and used as the insertion key. The insertion process is repeated until we get a tuple as 
the hrst element of the queue, which is then returned. 

The above distance browsing process continues until either the diverse result set is 
found, or until all points in the database are exhausted, signaled by the pqueue becoming 
empty. 

3.2 Finding Diverse Results 

We first present a simple greedy approach, called Immediate Greedy, and then its exten- 
sion Buffered Greedy, for efficiently finding result sets that are both close and diverse. 

Immediate Greedy Approach. In the ImmediateGreedy (IG) method, tuples are sent in 
increasing order of their spatial distance from the query point using distance browsing, 
as discussed above. The hrst tuple is always inserted into the result set, TZ, to satisfy 
the requirement that the closest tuple to the query point must hgure in the result set. 
Subsequently, each new tuple is added to TZ if it is diverse with respect to all tuples 
currently in TZ\ otherwise, it is discarded. This process continues until TZ grows to contain 
K tuples. Note that the result set obtained by this approach has following property: Let 
H = 61 , . . . , be the sequence formed by any other fully diverse set such that elements 
are listed in increasing order of spatial distance from Q. Now if i is the smallest index 
such that bi ^ Ri{RieTZ), then SpatialDist{bi, Q) > SpatialDist{Ri, Q). 

While the IG approach is straight forward and easy to implement, there are cases 
where it may make poor choices as shown in Figure 1 . Here, Q is the query point, and Pi 
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through P5 are the tuples in the database. Let us assume that the goal is to report 3 diverse 
tuples with MinDiv of 0.1. Clearly, {Pi, P3, P4} satisfies the diversity requirement. 
Also DIV{Pi, P2, V (Q)) = true. But inclusion of P2 disqualifies the candidatures of 
P3 and P4 as both P>/C(P2, P3, C(Q)) = false and P/1/(P2, P4, C(Q)) = false. 
By inspection, we observe that the overall best choice could be jPi, P3, P4}, but 
Immediate Greedy would give the solution as jPi, P2, P5}. Moreover, if point P5 is 
not present in the database, then this approach will fail to return a fully diverse set even 
though such a set, namely {Pi, P3, P4}, is available. 





Fig. 1. Poor Choice by Immediate Greedy Fig. 2. Heuristic in Buffered Greedy Approach 



Buffered Greedy Approach. The above problems are addressed in the BufferedGreedy 
(BG) method by recognizing that in IG, only the diverse points (hereafter called “lead- 
ers”) in the result set, are retained at all times. Specifically, BG maintains with each 
leader a bounded buffered set of “dedicated followers” - a dedicated follower is a point 
that is not diverse with respect to a specific leader but is diverse with respect to all re- 
maining leaders. Our empirical results show that a buffer of capacity K points (where K 
is the desired result size) for each leader, is sufficient to produce a near-optimal solution. 
The additional memory requirement for the buffers is small for typical values of K and 
D (e.g., for K=10 and D=10, and using 8 bytes to store each attribute value, we need 
only 8K bytes of additional storage). 

Given this additional set of dedicated followers, we adopt the heuristic that a current 
leader, Li, is replaced in the result set by its dedicated followers Ff , Ff , . . . , Ff (j > 1) 
as leaders if (a) these dedicated followers are all mutually diverse, and (b) incorporation 
of these followers as leaders does not result in the premature disqualification of future 
leaders. The first condition is necessary to ensure that the result set contains only diverse 
points, while the second is necessary to ensure that we do not produce solutions that are 
worse than Immediate Greedy. For example, if in Figure 1, point P5 had happened to 
be only a little farther than point P4 such that DIV{P 2 , P 5 ,V{Q)) = true, then the 
replacement of P2 by P3 and P4 could be the wrong choice since {Pi, P2, P5} may turn 
out to be the best solution. 

To implement the second condition, we need to know when it is “safe” to go ahead 
with a replacement i.e., when it is certain that all future leaders will be diverse from 
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the current set of followers. To achieve this, we take the following approach: For each 
point, we consider a hypothetical sphere that contains all points in the domain space 
that may be non-diverse with respect to it. That is, we set the radius of the sphere 
to be equal to the distance of the farthest non-diverse point in the domain space. Note 
that this sphere may contain some diverse points as well, but our objective is to take a 
conservative approach. Now, the replacement of a leader by selected dedicated followers 
can be done as soon as we have reached a distance greater than Rm with respect to the 
farthest follower from the query - this is because all future leaders will be diverse with 
respect to selected dedicated followers and there is no possibility of disqualification 
beyond this point. To clarify this technique, consider the following example: 



Example 3. In Figure 2, the circles around Pi and P 2 show the areas that contain all 
points that are not diverse with respect to Pi and P 2 , respectively. Due to distance 
browsing technique, when we access the point Tnew (Figure 2), we know that all future 
points will be diverse from Pi and P 2 - At this time, if Pi and P 2 are dedicated followers 
of L and mutually diverse, then we can replace L by {Pi, P 2 }. 

The integration of Buffered Greedy with distance browsing as well as pruning op- 
timizations for minimizing the database processing are discussed in [6]. Further, the 
complexity of Buffered Greedy is shown to be 0{NK^) in [6]. 

4 Experiments 

We conducted a detailed suite of experiments to evaluate the quality and efficiency of the 
MOTLEY algorithm with regard to producing a diverse set of answers. While a variety 
of datasets were used in our experiments (see [6] for details), we report on only one 
dataset here, namely Forest Cover [10], a real dataset containing 581,012 tuples and 4 
attributes representing Elevation, Aspect, Slope, and Distance. 

Our experiments involve uniformly distributed point queries across the whole data 
space, with the attribute domains normalised to the range [0, 1]. The default value of K, 
the desired number of answers, was 10, unless mentioned otherwise, and MinDiv was 
varied across [0, 1]. In practice, we expect that MinDiv settings would be on the low side, 
typically not more than 0.2, and we therefore focus on this range in our experiments. The 
decay rate (a) of the weights (Equation 2) was set to 0.1, Harmonic Mean was used for 
the Agg function (Equation 3), and spatial distances were computed using the Euclidean 
metric. The R-tree (specifically, the R* variant [1]) was created with a fill factor of 0.7 
and branching factor 64. 



Result-set Quality. We begin by characterizing the quality of the result set provided by 
MOTLEY, which is a greedy online algorithm, against an off-line brute-force optimal 
algorithm. This performance perspective is shown in Figure 3, which presents the average 
and worst case ratio of the result set scores. As can be seen in the figure, the average 
case is almost optimal (note that the Y-axis of the graph begins from 0.8), indicating 
that MOTLEY typically produces a close-to-optirnal solution. Moreover, even in the 
worst-case, the difference is only around 10 percent. Importantly, even in the situations 
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MinDiv 



Fig. 3. MOTLEY vs. Optimal 



Fig. 4. MOTLEY vs. KNN 



where MOTLEY did not provide the complete optimal result set, the errors were mostly 
restricted to only one or two points, out of the total of ten answers. 

Figure 4 shows, as a function of MinDiv, the average distance of the points in 
motley’s result set - this metric effectively captures the cost to be paid in terms 
of distance in order to obtain result diversity. The important point to note here is that 
for values of MinDiv up to 0.2, the distance increase is marginal, with respect to the 
traditional KNN query {MinDiv = 0). Since, as mentioned earlier, we expect that users 
will typically use MinDiv values between 0 and 0.2, it means that diversity can be 
obtained at relatively little cost in terms of distance. 



Execution Efficiency. Having established the high-quality of MOTLEY answers, we 
now move on to evaluating its execution efficiency. In Figure 5, we show the average 
fraction of tuples read to produce the result set as a function of MinDiv. Note firstly 
that the tuples scanned are always less than 15% of the complete dataset. Secondly, 
at lower values of MinDiv, the number of tuples read are small because we obtain K 
diverse tuples after processing only a small number of points, whereas at higher values of 
MinDiv, pruning is more effective and hence the number of tuples processed continues 
to be small in comparison to the database size. 



Effect of K. We also evaluated the effect of K, the number of answers, on the algorithmic 
performance. Figure 6 shows the percentage of tuples read as a function of MinDiv for 



60 - 



2 0 




MinDiv 



■g 100 n 

c 

3 80 - 



I MinDiv = 0.1 □MinDiv = 0.2 I 



60 ~ 




0- K=1 K=5 K=10 K=50 K=100 



Fig. 5. Execution Efficiency of MOTLEY 



Fig. 6. Effect of K 



Providing Diversity in K-Nearest Neighbor Query Results 413 



different values of K ranging from 1 to 100. For K = 1, it is equivalent to the traditional 
NN search, irrespective of MinDiv, due to requiring the closest point to form part of 
the result set. As the value of K increases, the number of tuples read also increases, 
especially for higher values of MinDiv. However, we can expect that users will specify 
lower values of MinDiv for large K settings. 



5 Conclusions 

In this paper, we introduced the problem of finding the K Nearest Diverse Neighbors 
(KNDN), where the goal is to find the closest set of answers such that the user will find 
each answer sufficiently different from the rest, thereby adding value to the result set. We 
provided a quantitative notion of diversity that ensured that two tuples were diverse if 
they differed in at least one dimension by a sufficient distance, and presented a two-level 
scoring function to combine the orthogonal notions of distance and diversity. 

We described MOTLEY, an online algorithm for addressing the KNDN problem, 
based on a buffered greedy approach integrated with a distance browsing technique. 
Pruning optimizations were incorporated to improve the runtime efficiency. Our exper- 
imental results demonstrated that MOTLEY can provide high-quality diverse solutions 
at a low cost in terms of both result distance and processing time. In fact, MOTLEY’S 
performance was close to the optimal in the average case and only off by around ten 
percent in the worst case. 
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Abstract. A'-means clustering is a popular data clustering algorithm. 
Principal component analysis (PCA) is a widely used statistical tech- 
nique for dimension reduction. Here we prove that principal components 
are the continuous solutions to the discrete cluster membership indi- 
cators for A'-means clustering, with a clear simplex cluster strcuture. 
Our results prove that PCA-based dimension reductions are particular- 
lly effective for for A'-means clustering. New lower bounds for A'-means 
objective function are derived, which is the total variance minus the 
eigenvalues of the data covariance matrix. 



Introduction 

Data analysis methods are essential for analyzing the ever-growing massive quan- 
tity of high dimensional data. On one end, cluster analysis attempts to pass 
through data quickly to gain first order knowledge by partitioning data points 
into disjoint groups such that data points belonging to same cluster are similar 
while data points belonging to different clusters are dissimilar. One of the most 
popular and efficient clustering methods is the A'-means method [2] which uses 
prototypes to represent clusters minimizing the squared error function. 

On the other end, high dimensional data are often transformed into lower 
dimensional data via the principal component analysis (PCA) [3] (or singular 
value decomposition) where coherent patterns can be detected more clearly. Such 
unsupervised dimension reduction is used in broad areas such as meteorology, 
image processing, genomic analysis and information retrieval. 

The main basis of PCA-based dimension reduction is that PCA picks up 
the directions with the largest variances. Mathematically , this is equivaleit to 
hnding the best low rank approximation (in L 2 norm) of the data via the singular 
value decomposition (SVD). However, this noise reduction property alone is 
inadequate to explain the effectiveness of PCA. 

In this paper, we prove that principal components are actually the contin- 
uous solution of the cluster membership indicators in the A'-means clustering 
method, i.e., the PCA dimension reduction automatically performs data clus- 
tering according to the A'-means objective function. This provides an important 
justihcation for PCA-based data reduction. 

Our results also provide effective ways to solve the A'-means clustering prob- 
lem. A'-means method uses A' prototypes, the centroids of clusters, to charac- 
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terize the data. They are determined by minimizing the sum of squared errors, 

K 

k = l i£Ck 

where (xi, • • ■ , x„) = X is the data matrix, jUk = 'X-i/n/i- is the centroid of 

cluster Cfc and nj, is the number of points in Cfc. Standard iterative solution to 
X-means suffers from a well-known problem: as iteration proceeds, the solutions 
are trapped in the local minimadne to the greedy nature of the update algorithm. 

Notations on PCA. X represents the original data matrix ; Y = (yi, • • • , yn), 
y; = Xi — X, represents the centered data matrix, where x = '^■Xi/n. The 
covarance matrix (ignoring the factor 1/n ) is “ x)(xi — xY = YY^ . 

Principal directions uj, and principal components vj, are eigenvectors satisfying: 
YY^Wk = XkUk, Y^Yvk = AfcVfc. These are the defining equations for the SVD 
of Y : Y = ^kUk^Y Elements of vj, are the projected values of data points 
on the principal direction uj,. 

The simplex striir.tiire of Jf-means clustering 

The main results of the paper to propose the simplex structure in the cluster 
membership indicator space as a key concept in clustering and prove that this 
indicator space is an orthogonal transformation of the PCA subspace. 

Indicator space and simplex structure. The solution of clustering can 
be represented by K unsigned cluster membership indicator vectors: Hk = 
(hi, • • • , h*-), where 

rik 

h, = (0,---,0,C^,Q,---,0)7ny" (1) 

(Without loss of generality, we index the data such that data objects within 
each cluster are adjacent.) Consider the simplex consisting of K basis vectors hj, 
plus the origin in the A'-dimensional indicator space spanned by All objects 
belonging to a cluster will collapse into the same corner of the simplex. Different 
clusters collapse into different corners and they become well separated. 

Thus solving the clustering problem is reduced to computing the simplex in 
the indicator space. 

Theorem 1 (Simplex Structure Theorem). 

(i) The continuous solution of the transformed indicators, Qk = HkT, are Vk = 
(vi , • • • , Vk-_i , Y^e). Tn other words the A-dimensional indicator space is an 
orthogonal transformation of the principal component snbspace. 

(ii) The k x k transformation matrix T = (ti, • • • , tY) are K eigenvectors of the 
following eigenvalue equation 

rtk — \k^-k (2) 

where 

r = f O = diag(7n7, • • •,7n7). 
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and 

^ij — ^ ii Az — ^ ' ^ij 7 (^) 

i 

where Cij > 0 (i ^ j), fiii = 0, ctij = nji, and Y.ij <^ij = 1- 
(iii) ,/k satisfies the upper and lower bounds 

K— 1 

ny2 _ ^ Aa; < Jk < (4) 



where ny^ is the total variance and are the principal eigenvalues of the 
covariance matrix YY^ . 

The proof of Theorem 1 is outlined in appendix. The spectral relaxation of 
A'-means is first studied in [4]. Now we prove the effectiveness PCA dimension 
reduction as related to A-means cluster with the following. 

Proposition 2. The A'-means clustering has two invariant properties: (i) it is 
invariant under any orthogonal coordinate rotation operation 

A : X ^ Tx, 

(ii) it is invariant under coordinate translation (shift) operation 

L : X ^ X + t. 



Proof. The A-means objective function can be written as 



K 



k=-l i,j £Ck 




(5) 



Now (i) follows from the fact that for orthonormal transformations, T, TT^ = /; 
thus llTxj'— Txjll = ||xi— Xj||. (ii) is a simple consequence of ||(xj'+i:)—(xj+£)|| = 

||Xj- — Xj ||. □ 

Proposition 3. PCA dimension reduction is effective A'-means clustering. 
Proof. From Thereom 1, the eigen-space Vk are the relaxed solutions of the 
transformed indicators Qk, i.e., A'-means clustering in eigen-space Vk are ap- 
proximately equivalent to that in the transformed indicator space Qk- Recause 
A'-means clustering is invariant w.r.t. the orthogonal transformation T (Propo- 
sition 2), A'-means clustering in Qk space is equivalent to A'-means clustering in 
the indicator space Hk- In Hk space, all objects belonging to a cluster (approx- 
imately) collapse into the same corner of the simplex. Different clusters become 
well separated. Hence clustering in Hk space is particularly effective — our re- 
sults provides a theoretical basis for the use of PCA dimension reduction for 
A'-means clustering. 
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Table 1. Clustering accuracy as the PCA dimension is reduced from original 1000. 



Dim A5(balanced) A5(un-bal.) B5(balanced) B5(un-bal.) 



5 


0.81/0.91 


0.88/0.86 


0.59/0.70 


0.64/0.62 


6 


0.91/0.90 


0.87/0.86 


0.67/0.72 


0.64/0.62 


in 


0.90/0.90 


0.89/0.88 


0.74/0.75 


0.67/0.71 


20 


0.89 


0.90 


0.74 


0.72 


40 


0.86 


0.91 


0.63 


0.68 


1000 


0.75 


0.77 


0.56 


0.57 



Experiments on Internet Newsgroups 



A 2Q-newsgroup dataset is from www.cs.cmu.edu/afs/cs/project/theo-l 1 /www/ 
naive-bayes.html. W ord - document matrix is first constructed. 1000 words are 
selected according to the mutual information. Standard tf . idf term weighting 
is used. Each document is normalized to 1 . W e focus on two sets of 5-newsgroup 
combinations: 





A5: 




B5: 


WG2: 


comp. graphics 


MG2: 


comp. graphics 


WG9: 


rec .motorcycles 


WG3: 


comp . os . ms-windows 


WGIO: 


rec . sport .baseball 


WG8: 


rec . autos 


WG15: 


sci . space 


MG13: 


sci . electronics 


WG18: 


talk. politics. mideast 


MG19: 


talk. politics .misc 



W e apply /C-means clustering in the PCA subspace. Here we reduce the data 
from the original 1000 dimensions to 40, 20, 10,6,5 dimensions respectively. The 
clustering accuracy on 10 random samples of each newsgroup combination and 
size composition are averaged and the results are listed in Table 1 . To see the 
subtle difference between centering data or not at 10, 6, 5 dimensions; results 
for original uncentered data are list at left and the results for centered data are 
listed at right. 

Two observations. (1) It is clear that as dimensions are reduced, the re- 
sults systematically and significantly impro^es. For example, for datasets A-5- 
balanced, the cluster accuracy improves from 75% at 1000-dim to 91% at .5-dim. 
(2) For very small number of dimensions, PCA based on the centered data seem 
to lead to better results. 



Appendix 

Here we outline the proof of Theorem 1. First, ,/k can be written as 

i k * i,j£C'k 



( 6 ) 



The first term is a constant. The second term is the sum of the K diagonal block 
elements of X^X matrix representing within-cluster (inner-product) similarities. 
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Using Eq.(l), Eq.(6) becomes 

,/k = Tr(X^X) - (hfX^Xhi + • • • + hlX^Xhk). (7) 

There are redundancies in 77^-. Eor example, = e. Thus one of the 

hfc’s is linear combination of others. W e remove this redundancy by (a) perform- 
ing a linear transformation T into qj,’s: = (qi, • • • , q*-) = H^T, or qi = 

where T = (Uy) is a k x k orthonormal matrix: T^T = /, and (b) 
requiring that the last column of T is 

*A'= i\/ni/n,---,\/nk/nY, (8) 

leading to q*- = ^n\jn hi \fn^Jn h^- = \/\Jn e. 

The mutual orthogonality of hj,, h|^h/ = 8ki implies the mutual orthogonality 
of qj,, qfqr = hi - Let Q*-! = (qi , • • ■ , q*-i ), the orthogonality relation become 

= 1,-1, (9) 

q^e = 0, for A: = 1 , • • • , A' — 1 . (10) 

which must be maintained. The /I'-means objective can now be written as 

Jk = Tr(X^X) - fXX^Xejn - Tr(Qf_i (11) 

./k does not distinguish the original data {x;} and the centered data {yi}. Re- 
peating the above derivation on {yi}, we have 

./k = Tr(y'^Y) - Tr(Q,^ir'^yQfc_i), (12) 

noting Ye = 0. The first term is constant. The problem reduces to optimization 

of the second term subject to the constraints Eqs.(9,10). 

Now we relax the restriction that qj, must take discrete values, and let 
qs, take continuous values, while keeping constraint Eq.(9). This maximization 
problem is well-known via the Ky Ean theorem [1]. The solution are Q,-i = 
(vi, • • • , v*-_i), the principal components, which also automatically satisfy the 
constraint Eq.(lO). Erom this, (i) and (hi) follow. 

To prove (ii), we can show that (a) t*- of Eq.(8) is an eigenvector of f with 
eigenvalue A = 0. (b) The symmetric F is also semi-positive definite. Other K — 1 
eigenvectors are mutually orthonormal, tjt; = These prove (ii). □ 
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Abstract. Sequential pattern mining is a well-studied problem. In the context 
of mobile computing, moving sequential patterns that reflects the moving be- 
havior of mobile users attracted researchers’ interests recently. In this paper a 
novel and efficient technique is proposed to mine moving sequential patterns. 
Firstly the idea of clustering is introduced to process the original moving histo- 
ries into moving sequences as a preprocessing step. Then an efficient algorithm 
called PrefixTree is presented to mine the moving sequences. Performance 
study shows that PrefixTree outperforms LM algorithm, which is revised to 
mine moving sequences, in mining large moving sequence databases. 



1 Introduction 

Mining moving sequential patterns has great significance for effective and efficient 
location management in wireless communication systems. The problem of mining 
moving sequential patterns is a special case of mining traditional sequential patterns 
with the extension of support. There are mainly four differences between mining 
conventional sequential patterns and moving sequential patterns. Firstly, if two items 
are consecutive in a moving sequence a, and a is a subsequence of [3, those two items 
must be consecutive in [3. This is because we care about what the next move is for a 
mobile user in mining moving sequential patterns. Secondly, in mining moving se- 
quential patterns the support considers the number of occurrences in a moving se- 
quence that helps a more reasonable pattern discovery, so the support of a moving 
sequence is the sum of the number of occurrence in all the moving sequences of the 
whole moving sequence database. Thirdly, the Apriori property plays an important 
role for efficient candidate pruning in mining traditional sequential patterns. For ex- 
ample, suppose <ABC> is a frequent length-3 sequence, and then all the length-2 
subsequences {<AB>, <AC>, <BC>} must be frequent in mining sequential patterns. 
In mining moving sequential patterns <AC> may not be frequent. This is because a 
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mobile user can only move into a neighboring cell in a wireless system and items 
must be consecutive in mining moving sequential patterns. In addition, <AC> is not a 
subsequence of <ABC> any more in mining moving sequential patterns and that any 
subsequence of a frequent moving sequence must be frequent is still fulfilled from 
that meaning, which is called Pseudo-Apriori property. The last difference is that a 
moving sequence is an order list of items, hut not an order list of itemsets, where each 
item is a cell id. 

Wen-Chih Peng et al. presented a data-mining algorithm, which involves mining 
for user moving patterns in a mobile computing environment in [1]. Moving pattern 
mining is based on a roundtrip model [2], and their LM algorithm selects an initial 
location S, which is either VLR or HLR whose geography area contains the homes of 
the mobile users. Suppose a mobile user goes to a strange place for one month or 
longer, the method in [1] cannot find the proper moving pattern to characterize the 
mobile user. A more general method should not give any assumption of the start point 
of a moving pattern. Basically, algorithm LM is a variant one from GSP [3]. The 
Apriori-based methods can efficiently prune candidate sequence patterns based on 
Aprior property, but in moving sequential pattern mining we cannot prune candidate 
sequences efficiently because the moving sequential pattern only preserves Pseudo- 
Apriori property. In the meanwhile Apriori-based algorithms still encounter problem 
when a sequence database is large and/or when sequential patterns to be mined are 
numerous and/or long [4]. 

Time factor is considered for personal paging area design in [5]. G. Das et al. pres- 
ent a clustering based method to discretize a times series in [6]. Time is also a very 
important factor in mining moving sequential patterns. In this paper, firstly the idea of 
clustering is also introduced into the mining of moving sequential patterns to discre- 
tize the time attribute of the moving histories, and the moving histories are trans- 
formed into moving sequences based on the clustering result. Then based on the idea 
of projection and Pseudo-Apriori property, an efficient moving sequential pattern 
mining algorithm called PrefixTree is proposed, which can effectively represent can- 
didate frequent moving sequences with a key tree structure of prefix trees. In addi- 
tion, the wireless network topology based optimization approach is also presented to 
improve the efficiency of the PrefixTree algorithm. 

The rest of the paper is organized as follows. Data preprocessing and the Prefix- 
Tree algorithm are given in section 2. Section 3 gives the experimental results from 
different viewpoints. Discussion is made in section 4. 



2 Mining Moving Sequential Patterns 

User moving history is the moving logs of a mobile user, which is an ordered (c, t) 
list where c is the cell ID and t is the time when the mobile user reaches cell c. Let 
MH=<(c,, tj), (Cj, tj), (Cj, t 3 ), ...(c„, y> be a moving history, and MH means the mo- 
bile user enters c, at t,, leaves Cj and enters c^ at t^, leaves c^ and enters Cj at tj...and 
finally enters c^ at t_^. Each (c^, t;)sMH (l<i<n) is called an element of MH. The time 
difference between two consecutive elements in a moving history reflects a mobile 
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users’ moving speed (or sojourn time in a cell). If the difference is high, it shows the 
mobile user moves at a relatively low speed; otherwise, if the difference is low, it 
shows the mobile user moves at a relatively high speed. The idea of using clustering 
to discretize time is that if a mobile user possesses regular moving behavior, then he 
often moves on the same set of paths, and the arrival time to each point of the paths is 
similar. 

The clustering algorithm CURD [7] is used to discretize the time attribute of the 
moving histories. For each cell c in the cell set, all the elements in the moving history 
database D is collected, denoted by ES(c)={(c, t)|3(c, OeMR and MFIjGD}. Then the 
CURD algorithm is used to cluster the element set ES(c), where Euclidean distance 
on time t is used as the similarity function between two elements in ES(c). This is a 
clustering problem in one-dimension space. After the clustering processing we have a 
clustering result as {(c, T^, TJ|T^, T^eT and T<TJ(T is the time domain), which 
means the mobile user often enters cell c at the period [T^, TJ. 

The main idea of data transformation is to replace the MH elements with the corre- 
sponding clusters that they belong to. The transformed moving history is called a 
moving sequence, and the transformed moving history database is called moving 
sequence database. Each moving history can be transformed into one and only one 
moving sequence because the clustering method guarantees that any MH element 
belongs to one and only one cluster. 

The PrefixTree algorithm only need scan the database three times, and the key idea 
of PrefixTree is the use of the prefix trees. Prefix tree is a compact representation of 
candidate moving sequential patterns. The root is the frequent item, and is defined at 
depth one. Each node contains three attributes: one is the item, one is the count attrib- 
ute which means the support of the item, and the last one is the flag indicating 
whether the node is traversed. The items of a node’s children are all contained in its 
candidate consecutive items. In the first two scans PrefixTree generates the frequent 
itmes, frequent length-2 moving sequential patterns and CCIs of each frequent item, 
and the prefix trees are constructed in the third scan. For each item Q, we call the 
items that may appear just after it candidate consecutive items. It is easy to know that 
only the items after Q in a moving sequence may be the consecutive items of Cp 
denoted by CCI(Cj). And for any item C.6 CCI(C|), length-2 moving sequence <C|C> 
is frequent. It is easy for us to generate the moving sequential patterns based on the 
prefix trees. Every moving sequence from the root node to the leaf node is a candi- 
date frequent moving sequences. Scanning all the prefix trees once can generate all 
the moving sequential patterns. The support of each node decreases with the depth 
increase, so a new frequent moving sequence is generated when we traverse the prefix 
trees from the root to the leaves when encountering a node whose count is less than 
the support threshold. 

The extension of support is considered when generating the prefix trees, which can 
be done by generating projected sequences. Another novelty of the PrefixTree algo- 
rithm is that it doesn’t generate projected physical files, which is different from 
traditional projection-based sequential algorithms [4] and is a time-cost work. 
Due to the room reason, we cannot give too many details here. 

As pointed in [8], the initial candidate set generation, especially for the length-2 
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frequent patterns, is the key issue to improve the performance of data mining. Fortu- 
nately, sometimes the apriori of the wireless network topology can be got. From the 
apriori the neighboring cells for each cell can be known in advance. It has been 
pointed out that for any two consecutive items in a moving sequence, one item must 
be the other one’s neighboring cell or itself. The number of neighbor cells is often 
small, for example it is two in one-dimension model, and six in two-dimension hex- 
agonal model and eight in two-dimension mesh model, and is relative small even in 
graph model [9]. So based on this apriori the number of the candidate length-2 mov- 
ing sequential patterns is decreased much, and then the frequent length-2 moving 
sequential patterns can be efficiently generated. 



3 Experimental Results and Performance Study 

All experiments are performed on a l.TGFIz Pentium 4 PC machine with 512M main 
memory and 60G hard disk, running Microsoft Windows 2000 Professional. All the 
methods are implemented using JBuilder 6.0. The synthetic dataset used in our ex- 
periments comes from SUMATRA (Stanford University Mobile Activity TRAces, 
which is available at http://www-db.stanford.edu/sumatra/). BALI-2: Bay Area Loca- 
tion Information (real-time) dataset records the mobile users’ moving and calling 
activities in a day. The mobile user averagely moves 7.2 times in a day in 90 zones, 
so the number of items is 90 and the average length of the moving sequences is 8.2. 
BALI-2 contains about 40,000 moving sequences, which are used for our experi- 
ments. 





Fig. 1. Performance study with varying support threshold, where the left part and the right pait 
are without and with optimization respectively 





Fig. 2. Performance study with varying number of moving sequences, where the left part and 
the right part are without and with optimization respectively 

Fig.l and Fig. 2 show that the PrefixTree algorithm, which needs only three scans 
of the moving sequence database, is more efficient and scalable than the Revised LM 
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algorithm, which needs multiple scans of the moving sequence database limited with 
the length of the longest moving sequential pattern. In addition, the optimization 
based on the wireless network topology improves much the efficiency of mining 
processing. Let |D| is the number of moving sequences in database D, and L is the 
length of the longest moving sequential pattern. Let reading a moving sequence in a 
data file costs 1 unit of I/O. The I/O cost of the Revised LM algorithm is equal to 
L(|D|); the I/O cost of PrefixTree is equal to 3(|D|). From the I/O costs analysis we 
could get a coarse conclusion that the PrefixTree algorithm is more efficient than the 
Revised LM algorithm if L is bigger than 3. The above simple I/O analysis of the 
PrefixTree algorithm and the Revised LM algorithm gives an evience of the Prefix- 
Tree algorithm’s efficiency showed in the experimental results. 



4 Discussion 

In this paper the idea of clustering method is introduced to discretize the time attrib- 
ute in moving histories. And then a novel and efficient method, called PrefixTree, is 
proposed to mine the moving sequences. Its main idea is to generate projected se- 
quences and construct the prefix trees based on candidate consecutive items. It is 
highly desirable because in most cases the user tends to try a few minimum supports 
before being satisfied with the result. Another valuable function of the PrefixTree 
algorithm is supporting parameter tuning, which means the prefix trees with a higher 
support threshold can be generated directly from the prefix trees with a smaller one. 
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Abstract. In retail industry, it is very important to understand seasonal sales 
pattern, because this knowledge can assist decision makers in managing 
inventory and formulating marketing strategies. Self-Organizing Map (SOM) is 
suitable for extracting and illustrating essential structures because SOM has 
unsupervised learning and topology preserving properties, and prominent 
visualization techniques. In this experiment, we propose a method for seasonal 
pattern analysis using Self-Organizing Map. Performance test with real-world 
data from stationeiy stores in Indonesia shows that the method is effective for 
seasonal pattern analysis. The results are used to formulate several marketing 
and inventory management strategies. 

Keywords: Visualization, Clustering, Temporal Data, Self-Organizing Maps. 



1 Introduction 

In retail industry, it is very important to understand the seasonal sales pattern of 
groups of similar items, such as different brands of pencils. This knowledge can be 
used by decision makers in decision making process such as managing inventory and 
formulating marketing strategy. However, it is difficult to understand the seasonal 
sales pattern from large sales transaction database. Common ways to extract and 
illustrate essential structures from data are by visualization and clustering. 
Visualization is important since humans can absorb information and extract 
knowledge efficiently from images. Groups of similar items (such as pencils, pens) 
that have similar seasonal sales pattern are put together in the same cluster. 

Self-organizing map (SOM), a special type of artificial neural network that 
performs unsupervised competitive learning [9], is suitable for extracting and 
illustrating essential structures from data, because it requires no a priori assumptions 
about the distribution of the data [4], allows visualization and exploration of the high- 
dimensional data space by projecting onto simple geometric relationships, and follows 
the distribution of the original data set [9]. Because prototype vectors in SOM are 
topological ordered, i.e. similar objects are mapped nearby, SOM is suitable for 
clustering. 
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The aim of this paper is to extract and illustrate essential seasonal pattern structures 
from groups of similar sales items using Kohonen’s Self-Organizing Maps. This kind 
of approach had not been fully explored in real world application to a retail industry. 
This method is applied to real-world data from a retail industry in Indonesia that has 
not been used in other researches. 

The structure of the paper is as follows. Section 2 briefly describes the data and 
proposed method. The experiences in applying the method to a retail industry are 
discussed in Section 3 and the paper closes with a concluding section. 



2 Proposed SOM Method and Data 

2.1 Description of the Data 

In this experiment, we use real-life data from a major stationery retail chain based in 
Indonesia. The data used in this experiment is two years sales data, from 1999 to 
2000, from seven branches, with about 1.5 millions of transactional data for 17,836 
products. The products are divided into 34 groups of similar items (such as school 
equipment), and further divided into 551 sub-groups of more similar items (such as 
pencil cases). 

2.2 Proposed SOM Method for Seasonal Pattern Analysis 

Seasonal pattern analysis process using SOM can be generally divided into data 
preparation stages, map training, mapping the data set to the map, and lastly analysis 
based on visualization and clustering of the map. 

Data Pre-processing. In comparing time series, some difficulties might arise due to 
noise, offset translation, amplitude scaling, longitudinal scaling, linear drift, and 
discontinuities [8]. However, some of these problems can be considered not important 
in comparing seasonality pattern. Longitudinal scaling problems are not important, 
since data should be compared for the same time-period. Moreover, time-series with 
linear drift can be assumed as different time-series. 

Discontinuities, amplitude scaling and offset translation problem can be solved by 
performing pre-processing and normalization of the data set. Discontinuities problem 
can be alleviated by aggregating daily sales quantity in one month or using moving 
average. Amplitude scaling and offset translation problems can be solved by 
performing normalization of the data set. In our experiment, the monthly total sales 
quantities are divided by the average monthly sales in a year. This measurement is 
called seasonality index. The seasonality index value ‘5’ for a subgroup in February 
can be interpreted sales quantity in February is five times larger compared to the 
average monthly sales of that subgroup. In this experiment, seasonal pattern is defined 
as a series of seasonality index of 12 months. 

To compare similarity between time-series data, several researchers developed 
measurement based on longest common subsequence [1,2]. However, this kind of 
similarity measures is not appropriate for measuring similarity for seasonality pattern 
for several reasons. Firstly, the measurement allows gaps of the sequences that match. 
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which allows comparing seasonality indexes from different time periods, which is 
illogical. Secondly, this measure also does not consider the degree of differences. For 
instance, this measurement produces the same results for comparing {31 11}, {111 
3}, and {1 1 1 10}. As a result, the similarity of time-series data used in this 
experiment is Euclidian similarity measurement, because it is fast to compute thus 
allowing using it for clustering tasks. 

Map Training. In this experiment, the map size is 16x20 with hexagonal grid 
topology. The map is initialized based on two principal eigenvectors of the input data 
vectors, and trained using the batch version of SOM. After the map is trained, each 
data vector is mapped to the most similar prototype vector in the map (the best 
matching unit). The software used is SOM Toolbox 2.0 [12]. 

Analysis of SOM and Its Clustering Results. Generally, it is not easy to capture 
knowledge from SOM solely by observing the prototype vectors in SOM, especially 
when the map is large. Visualization of SOM can be used to capture pattern in the data 
set. Some useful visualization tools for seasonal pattern analysis are unified distance 
matrix (u-matrix) and component planes. The unified distance matrix (u-matrix) shows 
the distances between neighbouring nodes in the data space using a grey scale 
representation on the map grid [7]. Long distances show highly dissimilar features 
between neighbouring nodes are represented with darker colour, which divides 
clusters (dense parts of the maps with similar features). Component planes show the 
values of a certain component of prototype vectors in SOM [10]. 

Since visualization of the u-matrix can show the rough cluster structure of the data 
set, to get a finer clustering result, the prototype vectors from the map are clustered 
using partitive clustering technique (A:-means) [11]. After that, for each cluster simple 
descriptive statistical analyses, such as averages, of the cluster members’ attributes 
(such as sales quantity) are calculated. 



3 Results and Discussions 

Due to business data confidentiality, the complete results caimot be shown in this 
paper. However, pennission to use small amount of examples of the results is granted 
by the owner of the data. 

After the data is pre-processed, the total input vector for SOM training is 572 with 
12 dimensions. Once the map is trained, each node of the SOM stores a prototype 
vector of 12 components each of which represents seasonality index for each month 
from January to December and a set of subgroups that are mapped to the node. The 
values of each prototype vectors are visualized using bar plot in Fig. 1 . 

Since SOM follows the distribution of the original data set, the structure of the 
original data set can be extracted from the map. In Fig. 1, there are more nodes with 
roughly stable seasonality indexes throughout the year (the nodes in the centre of the 
map) than nodes with non-uniform seasonality indexes (the nodes at the edge of the 
map), it can be inferred that most of the subgroups in the original data set also have 
roughly same seasonal pattern. This knowledge is difficult to acquire by observing the 



An Alternative Methodology for Mining Seasonal Pattern 



427 



JIl. 1l i»- Jiu. jli. Jli^ iL. . 

JL- ilk. .Jil. iifttL- liaA.. MfL. JL-. Ju... JL A- 1, JL li L L 

JL- .k. -Jk. -Iw- Jlill- JMik- JMik- iM- JlriA. jliA. dkv- jk.- iiki*— iLii^. iL— til— ku 

JL Jlk- .rifflk. Jlilk. JttiU. ■flfltV. Jhk. ifldk- JllA- ibfc- Jsk^ ilU- jblwi-. jL- itu.-. iL^. tli ItL- L. 

-jIL. .rfdjL- .iwtt.- JJftjk. .Jiijb- jihLA- jJttifc I Aub — I I tL — Ll ■ 

J. jJinlW- Jkfc- Jltiii JlftiMi jQliihi • - JliLhi Jihi^i 

- -^in<k. JUk- J>B.b- riOlilfa ^flldu dlkM *^-*~* iQtii^ 

J. -■nilki JihLm Jfck. Jtlilh. jflUfcj jJBAd j 11 I 1.1 m Jam JhUfci 4 Hai niih«-J ht.^ ltw»*J 

rilW. -JL. -iJL« -jifllk. -jaJhj -jmlfa -jhIt^ jibfa .^mihi .rfiAiiM jia.^ irtirM iriifc^i joauJ ■t*<^ ■mwJ huJ 

iA« Hfhi .Jha -Jiajhi -jwlfci -JHHhj .JUiki -jifck. Jftrlhi ,^n*A« jAMt M.Ma ■Jill.l MMi MMi .jHhrtj 

rflhi _^Ll _Al _.j{hi iihAa .Jduki llilhi .ntkha nflAj inhfci '•'•*■ fTTlifc JUiMl n ^ ..iaJI 

_ij|hi Jfhi Ai -mA -ttfci -.Jhfci Ilirhi -jdliki .Jflilh* aObfel duAl .juA -^nA ..jwji -ti tl jj jJ jJ -jl 

Jlfc _Al _Al At .jA ^klfa .nbfcl ,«*Ai .kaAl .rilflA _ijfa il -Ji il j| j 

_jiL .rifci Ai Ai -nfci .Bthi -EAi _>Ji ..jHifci .^«A ii Jl il ii il J J 

Ai ^bW ssM -Al .Ai _ii h Il il ^il ^Jl -j 

^ _Aa _Al lAl _b<M _|A ml 



Fig. 1. The bar plot of each component values of each prototype vector 



original data set or calculating mean and standard deviation of each variable. For that 
reason, visualization of SOM makes it easier for analysts to acquire knowledge about 
the structure of the original data set. 

Table 1. Clusters' characteristics 





Total 


Total Sales Qtv in 2 Years 


Average of Seasonal Index 


No 


Member 


Average 


Sum 


% 


Jan 


Feb 


Mar 


Apr 


May 


Jun 


Jul 


Aug 


Sep 


Oct 


Nov 


Dec 


1 


79 


11,862 


937,126 


12.2% 


0.1 


0.1 


0.1 


0.2 


0.7 


1.2 


1.6 


1.5 


1.9 


1.7 


1.6 


1.2 


2 


20 


22,953 


459,055 


6.0% 


0.2 


0.3 


0.3 


0.4 


0.5 


1.1 


4.8 


1.7 


0.9 


0.7 


0.6 


0.4 


3 


127 


21,668 


2,751,805 


35.9% 


0.6 


0.7 


0.9 


1.0 


0.9 


1.0 


1.2 


1.3 


1.2 


1.1 


1.0 


0.9 


4 


56 


3,318 


185,820 


2.4% 


0.0 


0.0 


0.0 


0.0 


0.1 


0.1 


0.2 


0.2 


2.4 


3.8 


2.9 


2.4 


5 


25 


3,223 


80,583 


1.1% 


0.1 


0.1 


0.1 


0.1 


0.2 


0.5 


0.7 


2.2 


4.0 


1.9 


1.3 


0.8 


6 


23 


4,732 


108,834 


1.4% 


0.1 


0.1 


0.1 


0.5 


1.6 


4.0 


1.9 


1.5 


0.8 


0.6 


0.4 


0.3 


7 


145 


17,467 


2,532,653 


33.0% 


1.3 


1.4 


1.5 


1.4 


1.0 


0.9 


1.1 


0.9 


0.7 


0.6 


0.6 


0.5 


8 


18 


3,165 


56,968 


0.7% 


6.4 


2.5 


0.8 


0.6 


0.5 


0.2 


0.2 


0.3 


0.3 


0.1 


0.1 


0.1 


9 


7 


5,120 


35,839 


0.5% 


0.2 


1.2 


2.7 


5.4 


0.8 


0.4 


0.6 


0.1 


0.2 


0.2 


0.1 


0.1 


10 


21 


2,107 


44,242 


0.6% 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.5 


2.3 


5.7 


3.2 


11 


17 


16,824 


286,005 


3.7% 


0.9 


0.7 


0.5 


0.5 


0.5 


0.8 


1.0 


0.7 


1.0 


1.0 


1.8 


2.7 


12 


34 


5,565 


189,200 


2.5% 


i.i 


3.3 


2.7 


l.b 


U.b 


0.4 


0.4 


0.0 


0.2 


0.2 


0.1 


0.1 



3.1 Clusters of Seasonal Pattern 

Based on visualization of the u-matrix in Fig. 2, there are at least two large clusters 
(cluster A and B) and two relatively medium-sized clusters (cluster C and D). The 
position of these two largest clusters in Fig. 1 shows that these clusters contains 
subgroups that have stable or uniform sales pattern throughout the year. Therefore, the 
visualization of u-matrix and visualization of the value of prototype vectors can give a 
rough clusters structure in the original data set. 
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Fig. 2. The u-matrix 



Fig. 3. Clustering result of SOM 
using A'-means with 12 clusters 




Fig. 4. The component planes 



In order to get a finer clustering result, two-level clustering is used. In this 
experiment, we try to find clustering with total cluster from 2 to 12. Davies-Bouldin 
index [3] is used to determine the optimal number of cluster. Based on Davies- 
Bouldin index, the best clustering result from 2 to 12 total clusters is 12 clusters. 
Although any number of clusters can be chosen based on the purpose of the analysis, 
this experiment uses the clustering results with 12 clusters as shown in Fig. 3. 

Table 1 reveals some interesting clusters. The most interesting clusters are cluster 7 
and cluster 3, since they are the two largest clusters that contribute almost 69% of total 
sales quantity. The subgroups in these clusters have roughly stable seasonal pattern 
throughout the year with cluster 3 has a slightly higher sales in the second half-year 
while cluster 7 in the first half-year. This fact confirmed the analysis based on 
visualization of the u-matrix described earlier. Another interesting cluster is cluster 2, 
since the average sales quantity of the subgroups in this cluster is the highest and sales 
quantities of these subgroups in July are 4.8 higher compared to the average monthly 
sales quantity. 

These clustering results can be used to deteimine marketing strategy for product 
combination package (cross-selling). The common practice in this company is 
combining items that are complementary without taking seasonality factor into 
account, such as combining pens and writing pads. Flowever, selling items with the 
same seasonality pattern in one package during their sales peak times might not be 
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effective since the cost of the promotion (such as discount value) might be higher than 
the increase in sales of each item. As a result, we recommend that items from different 
cluster (different seasonal pattern) are sold as combo-pack to increase the sale of one 
product with the help of the other product. This strategy can be exercised throughout 
the year with different combination. For example, coiTection pens subgroup from 
cluster 6 that has a peak in May and refills for gel-based pens subgroup from cluster 
10 that has a peak in November can be combined in one package. This combination 
can increase the sales of white pens in its off-peak time as the sales of refill for gel- 
based pens are high near the end of the year. 

3.2 Component Planes of the Map 

The component planes, as shown in Fig. 4, show the seasonality indexes for a 
particular month. The colours in the component planes represent the value seasonality 
index. For example, the sixth component plane shows the value of seasonality indexes 
for June. In June, the subgroups in the top left region have the highest seasonality 
index, which are over 5, while the subgroups in the top right and bottom right region 
have a low seasonality index. These component planes also show the distribution of 
seasonal index for each month. 

These component planes can be used as a guide for devising strategies for each 
month, with giving more attention to the subgroups in red, yellow, and green regions. 
Two strategies are proposed based on the analysis of these component planes. 

Firstly, the stores can provide more variety of items for seasonal subgroups, which 
are the subgroups in red and yellow regions. For example, based on July’s component 
plane, the sale of the pencil cases subgroup has a non-uniformly sales distributed 
throughout the year with a peak in June. Consequently, the store can provide more 
variety of pencil cases to its customers during this time. As a result, the customer may 
view the stores as a store that provides a wide variety of items. Additionally, the sales 
of this subgroup may increase. This strategy will change the cuiTent practice that 
provides an approximately constant variety of items throughout the year. 

Secondly, the component planes can be used as a guide to arrange store layout and 
space allocation for specific purposes, especially for subgroups that lies in red region 
in that particular month. For example, pencil cases subgroup can be allocated a larger 
area and placed at a special place in June. This strategy can be valuable since the 
space of the store is very limited. This strategy will also change the current practice 
that allocates approximately constant space for each subgroup throughout the year. 



4 Conclusions 

The above discussion has highlighted the use of SOM in the seasonal pattern analysis 
using visualization and clustering. Based on our experiments, SOM is an effective tool 
for seasonal pattern analysis, since it provides visualization that reveals structure in 
data in comprehensible form. Based on visualization and clustering of SOM, several 
strategies based on seasonal pattern are proposed, such as determining product 



430 



D. Lee and V.C.S. Lee 



combination package and managing inventory. Therefore, hopefully, sale of these 
seasonal items can be increased. 

The ideal way to measure seasonality is by measuring demand after considering 
other factors that impact sales, such as discounts, inventory, and random effects. 
Future work incorporating with multi-attribute measure of seasonality is underway. In 
addition, future work includes combining the clustering result of seasonal sales pattern 
with the association rules discovered from the data set for determining promotional 
packages. Items with different seasonal pattern that frequently sold together can be 
combined as a promotional package. 

Acknowledgement. The authors would like to express gratitude to K. Julianto 
Mihardja and Liany Susanna for allowing the use of their data in this experiment. 
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Abstract. Many different algorithms are studied on association rules 
in the literature of data mining. Some researchers are now focusing on 
the application of association rules. In this paper, we will study one of 
the application called Item Selection for Marketing (ISM) with cross- 
selling effect consideration. The problem ISM is to find a subset of items 
as marketing items in order to boost the sales of the store. We prove 
a simple version of this problem is NP-hard. We propose an algorithm 
to deal with this problem. Experiments are conducted to show that the 
algorithms are effective and efficient. 



1 Introduction 

In the literature of data mining, there are a lot of studies on association rules 
[2] . Such studies are particularly useful with a large amount of data in order to 
understand the customer behavior in their stores. However, it is generally true 
that the results of association rule mining are not directly useful for the business 
sector. Therefore there has been research in examining more closely the business 
requirements and finding solutions that are suitable for particular issues, such 
as marketing and inventory control. Recently, some researchers [10] studied the 
utility of data mining such as association and clustering, on decision making for 
revenue-maximizing enterprises. They have formulated the general problem as 
an optimization problem where a profit is to be maximized by determining a best 
strategy. The profit is typically generated from the customer behaviour in such 
an enterprise. More specific problems for revenue-maximizing are considered in 
more recent works [6,14,9,5,4,17]. The related problem of mining user behaviour 
is also of much research interest recently and a number of results can be found 
in [18]. 

In this paper we consider the problem of selecting a subset of items in a store 
for marketing in order to boost the overall profit. The difficulty of the problem 
is that we need to estimate the cross-selling effect to determine the influence of 
the marketed items on the sales of the other items. It is known that the records 
of sales transactions are very useful [3] and we determine the cross-selling effect 
with such information. We call the problem defined this way Item Selection for 
Marketing (ISM). We show that a simple version of this problem is NP-hard. 
We propose a hill climbing approach to tackle this problem. In our experiment. 



H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 431—440, 2004. 
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we apply the proposed approach to a set of real data and the approach is found 
to be effective and efficient. 

2 Related Work 

One major target of data mining is solving decision making problems for the 
business sector. A study of the utility of data mining for such problems is investi- 
gated in [10], published in 1998. A framework based on optimization is presented 
for the evaluation of data mining operations. In [10] the general decision making 
problem is considered as a maximization problem as follows 

( 1 ) 

x^T> 

where V is the set of all possible decisions in the domain problem (e.g. inventory 
control and marketing), C is the set of customers, t/i is the data we have on 
customer z, and g{x, yi) is the utility (benefit) from a decision x and yi. However, 
when we examine some such decision problems more closely, we find that we are 
actually dealing with a maximization problem of the form 

max(;(a;,F) (2) 

x^T> 

where Y is the set of all or the set of data collected about all customers. The 
above is more appropriate when there are correlations among the behaviours of 
customers (e.g. cross-selling, the purchase of one item is related to the purchase 
of another item), or when there are interactions among the customers themselves 
(e.g. viral marketing, or marketing by word-of-month among customers). This 
is because we cannot determine g{) based on each single customer alone. 

We illustrate the above in two different problems that have been studied. 
The first problem is about optimal product selection [5,4,16,17] (in SIGKDD 
1999,2000,2002, and ICDM 2003, respectively). The problem is that in a typical 
retail store, the types of products should be refreshed regularly so that losing 
products are discarded and new products are introduced. Hence we are inter- 
ested to find a subset of the products to be discontinued so that the profit can 
be maximized. The formulation of the problem considers the important factor of 
cross-selling which is the influence of some products on the sales of other prod- 
ucts. The cross-selling factor is embedded into the calculation of the maximum 
profit gain from a decision. This factor can be obtained from an analysis of the 
history of transactions kept from previous sales which corresponds to the set Y 
in formulation (2).^ 

The second such problem is about viral marketing where we need to choose 
a subset of the customers to be the targets of marketing so that they can influ- 
ence more of other customers. Some related work can be found in [6,14,9,1] (in 

^ The problem is related to inventory management which has been studied in man- 
agement science, however, previous works are mostly on the problems of when to 
order, where to order from, how much to order and the proper logistics [15]. 
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SIGKDD 2001,2002,2003 and WWW2003, respectively). Again the profit gain 
from any decision relies on an analysis based on the knowledge collected about 
all customers. 

The problem that we tackle here is of a similar nature since we also consider 
the factor of cross-selling when calculating the utility or benefit of a decision. 
In our modeling, we adopt concepts of the association rules to model the cross- 
selling effects among items. 

Suppose we are given a set I of items, and a set of transactions. Each trans- 
action is a subset of I. An association rule has the form X — t /j, where AC/ 
and Ij G I — X; the support of such a rule is the fraction of transactions con- 
taining all items in X and item Ij; the confidence for the rule is the fraction 
of the transactions containing all items in set X that also contain item Ij. The 
problem is to find all rules with sufficient support and confidence given some 
thresholds. Some of the earlier work include [13,2,12]. 



3 Problem Definition 

In this section we introduce the problem of ISM. To the best of our knowledge, 
this is the first definition of item selection problem for marketing with the consid- 
eration of cross-selling effects. Item Selection for Marketing (ISM) is a problem 
to select a set of items for marketing, called marketing items, so as to maximize 
the total profit of marketing items and non-marketing items among all choices. 
In ISM, we assume that the sales of some items are affected by the sales of some 
other items. Given a data set with m transactions, ti,t 2 , and n items, 

Ii, I 2 , ..., In- Let / = {Ii, I 2 , ..., In}- The profit of item la in transaction ti before 
marketing is given by prof {la, ti). Let S' C / be a set of selected items. In each 
transaction ti, we define two symbols, t{ and di, for the calculation of the total 
profit. 

t'i = tiC\ S, di = ti~ t'i 

Definition 1 (Profit Before Marketing). The original profit Profito before 
marketing for all transactions is defined as: 



Profito = Y}T=i Huau P^of{Ia, U) (3) 



Suppose we select a subset S of marketing items. Marketing action such as 
discounting will be taken on S. Let us consider a transaction ti containing the 
marketing items la and non-marketing items Ib. If we market item la with a 
cost of cost{Ia, ti) (e.g. discount of item), the profit of item la after marketing in 
transaction ti will become prof {la, ti) — cost{Ia, U). After the marketing actions 
are taken, more of the marketing items, says la, will be purchased. We define 
the changes in the sales by a{T), where T is a set of items: 



a{T) 



sale volume of T after marketing 
sale volume of T before marketing 



( 4 ) 
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In the above the sale volume of T is measured by the total amount of the items 
in T that are sold in a fixed period of time.^ If a{{Ia}) = 1, then there is no 
increase of the sales of items la- If a({/a}) = 2, then the sales of la is doubled 
compared with the sales before marketing. 

On the other hand, without the consideration of cross-selling effect due to 
marketing, the profit of non-marketing items It is still prof{Ib,ti). With the 
consideration of cross-selling effects, some of the non-marketing items Ib will 
be purchased more if there is an increase of sales of marketing items la- The 
cross-selling factor is modelled by csf actor (T,Ib), where T is a set of marketing 
items la, and 0 < csf actor (T,Ib) < 1. That is, more customers may come to 
buy item Ib if some other items in T are being marketed. The increase of the 
sale of item Ib is modelled by (a(T) — l)csfactor{T,Ib)- ^ If a{T) = 1, then 
there is no increase of sales of marketing items in set T. So, there is no increase 
of sales of non-marketing item Ib- The term (a(T) — l)csfactor{T, Ib) becomes 
zero. Similarly, if a{T) = 2, the sales of items in set T is doubled. Thus, the 
increase of sales is modelled by csf actor {T, Ib)- 

Definition 2 (Profit After Marketing). The profit after marketing Profiti 
is defined as follows- 



Pront, = Er=i 



Oii{Ia}){prof{Ia,U) 



COSt{Ia,U)) 



+ E/6Gdi(l + “ l)csfactor{t'i,Ib))prof{Ib,ti)] 



( 5 ) 



Recall that is the set of items in transaction ti that are selected to be mar- 
keted. For each transaction ti, we compute the profit from the marketing items 
(discounted by cost{Ia,ti)), and the profit from the non-marketing items whose 
sales are influenced by csf actor))- Profiti is the sum of the profits from all 
tranasctions. The objective of marketing is to increase the profit gain compared 
with the profit before marketing. The profit gain is defined as follows. 

Definition 3 (Profit Gain). Profit gain is : 

Profit Gain = Profiti — Profito (6) 



From the above definitions, we can rewrite the profit gain as follows. 



Profit Gain = Profiti — Profito 



= E 



m 
2 = 1 



[(a({/a}) - l)prof(Ia,U) 



a{{Ia})cOSt{Ia,U)] 



+J2i,edMi*i) - l)csfactor{ti,Ib)prof{h,ti)] (7) 

^ We note that different items may have their different increase ratio of the sales (i.e. 
a({7i})). However, it is difficult to predict this parameter a({7i}) for each item h- 
For simplicity, we set all a({7i}) to be the same (e.g. oo) in this paper, which is the 
same as [6,14]. 

® If a({7i}) = ao for all i, then it is easy to see that a(T) = ao for any T (a subset of 

I) ■ 



ISM: Item Selection for Marketing with Cross-Selling Considerations 435 



Next we can formally define the problem of ISM: 

ISM: Given a set of transactions with profits assigned to each item in each 
transaction and the cross-selling factors, cs factor (), pick a set S from all given 
items which gives a maximum profit gain. 

This problem is at least as difficult as the following decision problem. 

ISM Decision Problem: Given a set of items and a set of transactions with 
profits assigned to each item in each transaction, a minimum profit gain G, and 
cross-selling factors, cs factor {), can we pick a set S such that Profit Gain > G1 

Note that the cross-selling factor can be determined in different ways, one 
way is by the domain experts. Let us consider the very simple version where 
csf actor {t[, la) = 1 for any non-empty set of That is, any selected items in 
the transaction will increase the sale of the other items with the same volume. 
This may be a much simplified version of the problem, but it is still very difficult. 

Theorem 1 (NP-hardness). The item selection for marketing (ISM) decision 
problem where csfactor{t[, la) = 1 for t^f=(j) and csfactor{t[, la) = 0 for t'^ = (j) 
is NP-hard. 

Proof: We shall transform the problem of MAX GUT to the ISM problem. 
MAX GUT [7] is an NP-complete problem defined as follows: Given a graph - 
(U, E) with weight w(e) = 1 for each e G E and positive integer K , is there 
a partition of V into disjoint sets Vi and V 2 such that the sum of the weights 
of the edges from E that have one endpoint in V\ and one endpoint in V 2 is 
at least K ? The transformation from MAXGUT to ISM problem is described 
as follows. Let G = K, a({/a}) = 2, and a{t'^) = 2. For each vertex v G V, 
construct an item. For each edge e G E, where e = (wi, 1^2)1 create a transaction 
with 2 items {vi,V 2 }- Set prof{Ij,ti) = 1 and cost{Ij,ti) = 0.5, where ti is 
a transaction created in the above, i = 1,2, ...,|U|, and Ij is an item in ti. 
It is easy to check that Profit Gain = csf actor {t'i, If). The above 

transformation can be constructed in polynomial time. When the problem is 
solved in the transformed ISM, the original MAX GUT problem is also solved. 
Since MAX GUT is an NP-complete problem, ISM problem is NP-hard. □ 



4 Association Based Cross-Selling Effect 

In the previous section, we see that the cross-selling factor is important in the 
problem formulation. The factor is indicated by csfactor(t'i,Ij), where t' is a 
set of items selected for marketing and Ij is another item. This factor can be 
provided by domain experts if they can estimate the impact of t' on Ij . However, 
in typical application, the amount of items would be large and it would be 
impractical to expect purely human analysis on these values. We suggest that 
the factor is to be determined by data mining technique based on the history 
of transactions collected for the application. We shall adopt the concepts of 
association rules for this purpose. 

Definition 4. Let di = {Yi,Y2,Y^, ...,Yq} where Yi refers to a single item for 
i = 1,2, q, then odi = Yi V I2 V I3 V .... V Yg. □ 
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In our remaining discussion, csfactor(t[, Ij) is equal to conf{ot'^ — >■ Ij), where 
conf{ot'^ — >• Ij) is the confidence of the rule of' — >■ Ij The definition of confidence 
here is similar to the definition of association rules. That is, 

csfactor{t'^,Ij) = conf{ot[ Ij) 

number of transactions containing any item in t) and Ij 

number of transactions containing any items in ' ' 

The reason for the above formulation is given as follows. A transaction can 
be viewed as a customer behavior. In transaction there are the cross-selling 
effect between any marketing items la in and non-marketing items in set di. 
Let us consider some cases. If all items in ti are being marketed, then there are no 
non-marketing items, and the profit gain is the difference between the profit of 
marketing items after marketing and that before marketing. If all items in ti are 
not marketed, as there are no marketing items, in transaction ti, there is no cross- 
selling effect from marketing items in transaction ti. Thus, the profit gain due to 
marketing becomes zero. Now, consider the case of a transaction containing both 
marketing items and non-marketing items. Suppose the customer who purchases 
any marketing items in set t' always purchases non-marketing items lb- This 
phenomenon is modelled by a gain rule of' — >■ lb- The greater the confidence of 
these rules is, the greater the cross-selling effect is. That is, if this confidence is 
high, then when more of t[ are sold, it means that very likely more of Ib will also 
be sold. ^ 

5 Hill Climbing Approach 

The ISM problem is likely to be very difficult. We propose here a hill climbing 
approach to tackle the problem.^ 

Let f{S) be the function of the profit gain of the selection S of marketing 
items. Initially, we assign S = {}. Then, we will calculate f{S U {/a}) for each 
item la- We choose the item Ib with the greatest value of f{S U {Ib}) and 
insert it into set S. The above process repeats for the remaining items whenever 

f{su{ib})> ns). 

5.1 Efficient Calculation of the Profit Gain 

As the formula of the profit gain is computationally intensive, an efficient calcu- 
lation of this formula is required. The hill climbing approach chooses the item 

The rule I — >■ odi is called a loss rule in [17], because in [17], the problem is to 
determine a set of items to be discontinued from a store, di refers to some items to 
be removed, and it may cause some loss in profit from other items. 

® We have also tried to apply the well-known optimization technique of quadratic 
programming. However, we could only approximate the problem by a quadratic 
programming problem and the approximation is not very accurate since we need to 
throw away terms in a Taylor’s series which may not be insignificant. The resulting 
performance is not as good as the hill climbing method and hence are not shown. 
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with the greatest profit gain for each iteration. Suppose S now contains k items 
at the fc-th iteration. At this iteration, we store the value of f{S) in a variable 
fs- At the {k + l)-th iteration, we can calculate f{S U {/a;}) from fs efficiently 
for all Ix ^ S. Let T be the set of transactions containing item Ix and at least 
one item in selection set S. We can calculate f{S U {Ix}) as 

f{S u {Ix}) ^fs + g{h) - h{S, T) + h{S u {Ix}, T) (9) 

where 5 ( 4 ) = [(a({4}) - I)prof{Ix,ti) - a({4})cost(4, t^)] 

h{X,T) = EE (a(t-) - l)csfactor{t’i,Ib)prof{Ib,ti) 

tieT hedi 

For h{X,T) we assume all items in set X are selected for marketing, i.e. t' = 
ti n AT, and di = U — f'. Function g{Ix) is the profit gain of marketing item 4 in 
all transactions. Function h{X,T) is the profit gain of non-marketing items for 
the selection X in all transactions in set T. 

Let us consider the calculation of f{S U {4})- For g{Ix), we need to add the 
profit gain of the newly added marketing item Ix after marketing (i.e. g{Ix)) 
to fs- For the remaining parts, we only deal with the transactions in set 4. 
We need to subtract the profit gain of non-marketing items for the selection S 
in all the transactions in set T (i.e. h{S,T)) and then add the profit gain of 
non-marketing items for the new selection S'U {4} hr all the transactions in set 
T (i.e. h{SU{Ix},T))- As the set T is typically small compared with the whole 
database, we can save much computation by restricting the scope of search to 4. 
In the actual implementation the scope restriction is realized by a special search 
procedure of a special FP-tree as described below. 

5.2 FP-Tree Implementation 

The transactions in the database are examined for computation whenever the 
confidence term conf{ot[ — >■ Ij) is calculated. So, we need to do this operation 
effectively. If we actually scan the given database, which typically contains one 
record for each transaction, the computation will be very costly. Here we make 
use of the FP-tree structure [8]. 

We construct an FP-tree 444 once for all transactions, setting the support 
threshold to zero, and recording the occurrence count of itemsets at each tree 
node. With the zero threshold, 444 retains all information in the given set of 
transactions. Then we can traverse 444 instead of scan the original database. 
The advantage of 444 is that it forms a single path for transactions with 
repeated patterns. In many applications, there exist many transactions with 
the same pattern, especially when the number of transactions is large. These 
repeated patterns are processed only once with 444. By traversing 444 once, 
we can count the number of transactions containing any items in set and item 
4 and number of transactions containing any items in set 

The details of the procedure can be found in the description of the function 
parseFPTree(N,D) in [17]. From our experiments this mechanism can greatly 
reduce the overall running time. 
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6 Empirical Study 

We have used Pentium IV 2.2GHz PC to conduct our experiments. In our ex- 
periments, we study the resulting profit gain of marketing using the proposed 
algorithm. After the execution of our algorithm, there will be a number of se- 
lected marketing items. Let there be J resulting items (or marketing items). 
Note that J is not an input parameter. We compare our results with the naive 
approach of marketing by choosing J items with the greatest values of profit 
gain in Definition 3 in Section 3 as marketing items, assuming no cross-selling 
effect (i.e. cs factor lb) = 0 for any set U and item /{,). This naive approach 
is called direct marketing. 

6.1 Data Set 

We adopted the data set from BMS WebView-1, which contains clickstream and 
purchase data collected by a web company and is part of the KDD-Cup 2000 
data [11]. There are 59,602 transactions and 497 items. The average transaction 
size is 2.5. The profit of each item is generated similarly as [17]. 

6.2 Experimental Results 

For the data set, we study two types of marketing method - discounted items and 
free items. For discounted items, the selling price is half of the original price. Free 
items are free of charge. As remarked in Section 3, we shall assume a uniform 
change in the sale volume for all marketing items, i.e. a{Ii) = a for all items 
li. This set up is similar to that in [6]. For the real data set, the experimental 
results of profit gains and execution time against a for the situation of discount 
items are shown in Figure 1 and Figure 2. Those for the situation of free items 
are shown in Figure 3 and Figure 4. In the graphs showing the profit gains, we 
show the number of resulting marketing items next to each data point of the 
hill climbing method. This number is also the number of iterations in the hill 
climbing method. 

In all the experiments, the profit gain for the hill climbing approach is always 
greater than that for direct marketing. This is because the proposed algorithm 
considers the cross-selling effect among items, but the direct marketing does not. 

The execution time of direct marketing is roughly constant and is very small 
in all cases. For the hill climbing approach, the execution time increases sig- 
nificantly with the increase in a. This is explained by the fact that when a is 
increased, the marketing effect increases, meaning that the increase in sale of 
marketing items will be greater, which also increases the sale of non-marketing 
items by cross-selling effect. The combined increase in sale will be able to bring 
more items to be profitable for marketing since they can now counter the cost 
of marketing. This means that the hill climbing approach will have more itera- 
tions as a increases since the introduction of each marketing item requires one 
iteration, and this means longer execution time. 

Note that in the scenario of free marketing items, direct marketing leads to 
zero or negative profit gain. This is because the items are free and generate no 
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profit, and hence when compared to the profit before marketing, the profit gain 
is zero or negative. In the synthetic data, it is found that the gain is zero in 
most cases, since the marketing items are chosen to be those with no recorded 
transaction. The gain becomes negative for the real data set. Such results are 
similar to those for direct marketing in [6]. 





Fig. 1. Profit Gains against a for Real Fig. 2. Execution Time against a for 

Data Set (Discount) Real Data Set (Discount) 





Fig. 3. Profit Gains against a for Real 
Data Set (Free) 



Fig. 4. Execution Time against a for 
Real Data Set (Free) 



7 Conclusion 

In this paper, we have formulated the problem Item Selection for Marketing 
(ISM) with the consideration of cross-selling effect among the items. We proved 
that a simple version of this problem is NP-hard. We adopt the concepts of 
association rules to the determination of the cross-selling factor. Then we propose 
a hill climbing approach to deal with this problem. We have conducted some 
experiments on both real data and synthetic data to compare our method with 
the results of a naive marketing method. The results show that our algorithm is 
highly effective and efficient. 
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Abstract. Mining freqnent tree patterns is an important research 
problems with broad applications in bioinformatics, digital library, e- 
commerce, and so on. Previous studies highly suggested that pattern- 
growth methods are efficient in frequent pattern mining. In this paper, we 
systematically develop the pattern growth methods for mining frequent 
tree patterns. Two algorithms. Chopper and XSpanner, are devised. An 
extensive performance stndy shows that the two newly developed algo- 
rithms outperform TreeMinerV [13] , one of the fastest methods proposed 
before, in mining large databases. Furthermore, algorithm XSpanner is 
snbstantially faster than Chopper in many cases. 



1 Introduction 

Recently, many emerging application domains encounter tremendous demands 
and challenges of discovering knowledge from complex and semi-structural data. 
For example, one important application is mining semi-structured data [2,4,7,11, 
12,13]. In [12], Wang and Liu adopted an Apriori-based technique to mine fre- 
quent path sets in ordered trees. In [7], Miyahara et al. used a directly generate- 
and-test method to mine tree patterns. Recently, Zaki [13] and Asai et al. [2] 
proposed more efficient algorithms for frequent subtree discovery in a forest, 
respectively. They adopted the method of rightmost expansion, that is, their 
methods add nodes only to the rightmost branch of the tree. 

Recently, some interesting approaches for frequent tree pattern mining have 
been proposed. Two typical examples are reported in [6,13]. These methods 
observe the Appriori property among the frequent tree sub-patterns: every non- 
empty subtree of a frequent tree pattern is also frequent. Thus, they smartly 
extend the candidate-generation-and-test approach to tackle the mining. The 
method for frequent tree pattern mining is efficient and scalable when the pat- 
terns are not too complex. Nevertheless, if there are many complex patterns in 

* This research is supported in part by the Key Program of National Natural Science 
Foundation of China (No. 69933010), China National 863 High-Tech Projects (No. 
2002AA4Z3430 and 2002AA231041), and US NSF grant IIS-0308001. 

H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 441—451, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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the data set, there can be a huge number of candidates need to be generated 
and tested. That may degrade the performance dramatically. 

Some previous studies strongly indicate that the depth-first search based, 
pattern-growth methods, such as FP-growth [5], TreeProjection [1] and H-mine 
[8] for frequent itemset mining, and PrefixSpan [9] for sequential pattern mining, 
can mine long patterns efficiently from large databases. That stimulates our 
thinking: “(Fan we extend the pattern-growth methods for efficient frequent tree 
mining?' This is the motivation of our study. 

Is it straightforward to extend the pattern-growth methods to mine tree pat- 
terns? Unfortunately, the previously developed pattern-growth methods cannot 
be extended simply to tackle the frequent tree pattern mining problem efficiently. 
There are two major obstacles. On the one hand, one major cost in frequent tree 
pattern mining is to test whether a pattern is a subtree of an instance in the 
database. New techniques must be developed to make the test efficient. On the 
other hand, there can be many possible “growing points” (i.e., possible ways 
to extend an existing pattern to more complex ones) in a tree pattern. It is 
non-trivial to determine the “good” growth strategy and avoid redundance. 

In this paper, we systematically study the problem of frequent tree pattern 
mining and develop two novel and efficient algorithms. Chopper and XSpanner, 
to tackle the problem. In algorithm Chopper, the mining of sequential patterns 
and the extraction of frequent tree patterns are separated as two phases. For 
each sequential pattern. Chopper generates and tests all possible tree patterns 
against the database. In algorithm XSpanner, the mining of sequential patterns 
and the extraction of frequent tree patterns are integrated. Larger frequent tree 
patterns are “grown” from smaller ones. 

Based on the above ideas, we develop effective optimizations to achieve ef- 
ficient algorithms. We compare both Chopper and XSpanner with algorithm 
TreeMinerV [13], one of the best algorithms proposed previously, by an exten- 
sive performance study. As an Apriori-based algorithm, TreeMinerV achieves the 
best performance for mining frequent tree patterns among all published meth- 
ods. Our experimental results show that both Chopper and XSpanner outperform 
TreeMinerV while the mining results are the same. XSpanner is more efficient 
and more scalable than Chopper. 

The remainder of this paper is organized as follows. In Section 2, we define the 
problem of frequent tree pattern mining. Algorithm Chopper and XSpanner are 
developed in Section 3 and 4 respectively. In Section 5, we present the results on 
synthetic and real dataset via comparing with TreeMinerV. Section 6 concludes 
the paper. 



2 Problem Definition 

A tree is an acyclic connected graph. In this paper, we focus on ordered, labelled, 
rooted trees. A tree is denoted as T{vq,N,L,E), where (1) vq & N is the root 
node' (2) N is the set of nodes; (3) L is the set of labels of nodes, for any node 
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Fig. 1. Tree and embedded subtree. Fig- 2. An example of tree database TDB. 



u G N, L{u) is the label of u; and (4) E is the set of edges in the tree. Please 
note that two nodes in a tree may carry the identical label. 

Let t; be a node in a tree T{vq, N, L, E). The level of v is defined as the length 
of the shortest path from wg to t). The height of the tree is the maximum level 
of all nodes in the tree. For any nodes u and v in tree T{vo), if there exists a 
path vo~. . .-U-. . .-V such that every edge in the path is distinct, then u is called 
an ancestor of v and v a descendant of u. Particularly, if (m, v) is an edge in the 
tree, then u is the parent of v and z; is a child of u. For nodes u, v\ and V 2 , if 
u is the parent of both vi and V 2 , then vi and V 2 are siblings. A node without 
any child is a leaf node., otherwise, it is an internal node. In general, an internal 
node may have multiple children. If for each internal node, all the children are 
ordered, then the tree is an ordered tree. We denote the fc-th child of node u as 
child^{u). In the case that such a child does not exist, child^{u) = null. 

Hereafter, without special mention, all trees are labelled, ordered, and rooted. 
An example is shown in Figure 1(a). The tree is of height 4. 

Given a tree T{vq, N, L, E), tree T'{v'q,N', L' , E') is called an embedded sub- 
tree of T, denoted as T' C T, if (1) N' C N\ (2) for any node u G N', 
L{u) = L'{u); and (3) for every edge {u,v) G E' such that u is the parent 
of V, u is an ancestor of v in T. Please note that the concept embedded subtree 
defined here is different from the conventional one.^ In Figure 1(b), an embedded 
subtree of tree T in Figure 1(a) is shown. 

A tree database is a bag of trees. Given a tree database TDB, the support of 
a tree T is the number of trees in TDB such that T is an embedded subtree, i.e., 
sup{T) = ||{r' G TDB\T C T'}||. Given a minimum support threshold min_sup, 
a tree T is called as a frequent tree pattern if sup(T) > minsup. 

Problem statement. Given a tree database TDB and a minimum support 
threshold minsup, The problem of frequent tree pattern mining is to find the 
complete set of frequent tree patterns from database TDB. 

In an ordered tree, by a preorder traversal of all nodes in the tree, a preorder 
traversal label sequence (or l-sequence in short) can be made. For example, the 



^ Conventionally, a tree G' whose graph vertices and graph edges form subsets of the 
graph vertices and graph edges of a given tree G is called a subtree of G. 
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/-sequence of tree T in Figure 1(a) is BACDBGEAED. Preorder traversal se- 
quences are not unique with respect to trees. That is, multiple different trees may 
result in an identical preorder traversal sequence. To overcome this problem, we 
can add the levels of nodes into the sequence and make up the preorder traversal 
label-level sequence (or P -sequence in short). For example, the /^-sequence of tree 
T is B0A1C1D2B3G3E4A2E3D1. We have the following results. 

Theorem 1 (Uniqueness of /^-sequences). Given trees Ti(wi, A^i, Li, i?i) 
and T 2 {v 2 , N 2 , L 2 , E 2 ) , their P -sequences are identical if and only if T\ and T 2 
are isomorphic, i.e., there exists a one-to-one mapping f : Ni ^ N 2 such that 
(1) fiyi) = f{v 2 ); (2) for every node u G Ni, Li{u) = L 2 {f{u)); and (3) for 
every edge (ui,it 2 ) € Ei, (/(ui), /(U 2 )) G E 2 - 

Lemma 1. Let S be the P-sequence of tree T{vq, N, L, E). 

1. The first node in S is Vq whose level number is 0; 

2. For every immediate neighbors L(u)iL(v)j in S, j < {i -\- 1); and 

3. For nodes u and v such that u is the parent of v, L(u)i {i > 0) is the nearest 

left neighbor of L(v)j in S such that j = (i — 1). 

Although an /-sequence is not unique with respect to trees, it can serve as 
an isomorph for multiple trees. In other words, multiple /^-sequences and thus 
their corresponding trees can be isomers of an /-sequence. 

Given a sequence S' = si • • • s„. A sequence S' = is called a subse- 

quence of S and S as a super sequence of S', denoted as S' C S if there exist 
I < i\ <■■■< im < n such that Si^ = s' for (1 < j < m). Given a bag of 
sequences SDB, the support of S in SDB is number of S’s super sequences in 
SDB, i.e., sup{S) = ||{S' G SDB\S C S'}||. 

Given a tree database TDB, the bag of the /-sequences of the trees in TDB 
form a sequence database SDB. We have the following interesting result. 

Theorem 2. Given a tree database TDB, let SDB be the corresponding l- 
sequence database. For any tree pattern T, let l(T) be the l-sequence ofT. Then, 
sup{T) < sup{l(T)) ; and T is frequent in TDB only ifl{T) is frequent in SDB. 

Theorem 2 provides an interesting heuristic for mining frequent tree pat- 
terns: we can first mine the sequential patterns in the /-sequence database, and 
then mine tree patterns accordingly. In particular, a sequential pattern in the 
/-sequence database (with respect to the same support threshold in both tree 
database and /-sequence database) is called an I -pattern. 

Given a tree T, not every subsequence of T’s /-sequence corresponds to an 
embedded subtree of T. For example, consider the tree T in Figure 1(a). The 
/-sequence is BAGDBGEAED. GBD is a subsequence. However, there exists 
no an embedded subtree T' in T such that its /-sequence is GBD. 

Fortunately, whether a subsequence corresponds to an embedded subtree 
can be determined easily from an /^-sequence. Let S be the /^-sequence of a 
tree T. For a node v i in S', the scope of v is the longest subsequence S' 
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Input: a tree database TDB and a support threshold min_sup; 

Output: all frequent tree patterns with respect to min_sup\ 

Method: 

(1) scan database TDB once, generate its Z-sequence database SDB\ 

(2) mine Z-patterns from SDB using algorithm Z-PrefixSpan; 

(3) scan TDB, generate candidates according to Z-patterns and find frequent tree; 



Fig. 3. Algorithm Chopper 



starting from v such that the label- level of each node in S' , except for v it- 
self, is greater than i. For example, the P sequence of tree T in Figure 1(a) is 
B0A1C1D2B3G3E4A2E3D1. The scope of .BO is B0A1C1D2B3G3E4,A2E3D1 
and the scope of D2 is B3G3EA. We have the following result. 

Lemma 2. Given a tree T. An l-sequence S = vi . . .Vk corresponds to an em- 
bedded subtree in T if and only if there exists a node Vi in the P -sequence ofT 
such that V 2 ■ ■ - Vk is a subsequence in the scope of a node vi. 



3 Algorithm Chopper 

Algorithm Chopper is shown in Figure 3. The correctness of the algorithm follows 
Theorem 2. In the first 2 steps, Chopper finds the sequential patterns (i.e., /- 
patterns) from the /-sequence database. Based on Theorem 2, we only need to 
consider the trees whose /-sequence is an /-pattern. In the last step. Chopper 
scans the database to generate candidate tree patterns and verify the frequent 
tree patterns. 

To make the implementation of Chopper as efficient as possible, several tech- 
niques are developed. The first step of Chopper is straightforward. Since the trees 
are stored as /^-sequences in the database, Chopper does not need to form the 
explicit /-sequence database SDB. Instead, it uses the /^-sequence database and 
just ignores the level numbers. 

In the second step of Chopper, we need to mine sequential patterns (i.e., /- 
patterns) from the /-sequence database. A revision of algorithm PrefixSpan [9], 
called /-PrefixSpan, is used. Some specific techniques have been developed to 
enable efficient implementation of PrefixSpan. Interested readers should refer to 
[9] for a detailed technical discussion. 

While PrefixSpan can find the sequential patterns, the tree pattern mining 
needs only part of the complete set as /-patterns. One key observation here is that 
only those l-patterns having a potential frequent embedded tree pattern should be 
generated. The idea is illustrated in the following example. 

Example 1. Figure 2 shows a tree database as the running example in this paper. 
Suppose that the minimum support threshold is 2, i.e., every embedded subtree 
is a frequent tree pattern if it appears at least in two trees in the database. 
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The /^-sequences of the trees are also shown in Figure 2. If we ignore the level 
numbers in the ^^-sequences, we get the /-sequences. Clearly, sequence (BCB) 
appears twice in the /-sequence database, and thus it is considered as a sequential 
pattern by PrefixSpan. However, we cannot derive any embedded tree in the 
database whose /-sequence is BCB. Thus, such patterns should not be generated. 

We revise PrefixSpan to /-PrefixSpan to mine only the promising /-patterns 
and prune the unpromising ones. In /-PrefixSpan, when counting sup(S) for 
a possible sequential pattern S from the (projected) databases, we count only 
those trees containing an embedded subtree whose /-sequence is S. Following 
Lemma 2, we can determine whether an /-sequence corresponds to an embedded 
subtree in a tree easily. For example, sup(BCB) = 0 and sup(ABC) = 1. Thus, 
both BCB and ABC will be pruned in /-PrefixSpan. 

It can be verified that many patterns returned by PrefixSpan, such as AB, 
ABC, ABC D, etc., will be pruned in /-PrefixSpan. In our running example, only 
9 /-patterns are returned, i.e.. A, AC, AD, B, BC, BCD, BC, C and D. 

In the last step of Chopper, a naive implementation would be as follows. We 
can generate all possible tree patterns as candidates according to the /-patterns 
found by /-PrefixSpan, and then scan the tree database once to verify them. 
Unfortunately, such a naive method does not scale well for large databases: 
There can be a huge number of candidate tree patterns! 

Chopper adopts a more elegant approach here. It scans the tree database 
against the /-patterns found by /-PrefixSpan. For each tree in the tree database. 
Chopper firstly verifies whether the tree contains some candidate tree patterns. 
If so, the counters of the tree patterns will be incremented by one. Then, Chop- 
per also verifies whether more tree candidate patterns corresponding to some 
/-patterns can be generated from the current tree. If so, they will be generated 
and counters will be set up with an initial value 1. Moreover, to facilitate the 
matching between a tree in the tree database and the /-patterns as well as can- 
didate tree patterns, all /-patterns and candidate tree patterns are indexed by 
their prefixes. Please note that a tree is stored as an /^-sequence in the database. 

The major cost of Chopper comes from two parts. On the one hand, /- 
PrefixSpan mines the /-patterns. The pruning of unpromising /-patterns improves 
the performance of the sequential pattern mining here. On the other hand. Chop- 
per has to check every tree against the /-patterns and the candidate tree patterns 
in the last step. In this step, only one scan needed. 

Although /-PrefixSpan can prune many unpromising /-patterns, some un- 
promising /-patterns still may survive from the pruning, such as BCD in our run- 
ning example. The reason is that the /-pattern mining process and the tree pat- 
tern verification process are separated in Chopper. Such unpromising /-patterns 
may bring unnecessary overhead to the mining. Can we integrate the I -pattern 
mining process into the tree pattern verification process so that the efficiency can 
be improved further? This observation motivates our design of XSpanner. 
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Input and output: same as Chopper; 

Method: 

(1) scan database TDB once, find frequent length-1 /-patterns; 

(2) for each length-1 /-pattern Xi, do 

(3) output a frequent tree pattern XiO; 

(4) form the (a;iO)-projected database TDB^^q\ 

(5) if TDBx^o has at least min_sup trees then mine the projected database; 



Fig. 4. Algorithm XSpanner 



4 Algorithm XSpanner 

Algorithm XSpanner is shown in Figure 4. Clearly, by scanning the tree database 
TDB only once, we can find the frequent items in the database, i.e., the items 
appearing in at least minsup trees in the database. They are length-1 /-patterns. 
This is done in Step (1) in algorithm XSpanner. 

Suppose that xi, . . . , Xm are the m length-1 /-patterns found. We have the 
following two claims. On the one hand, for (1 < i < m), XiO is a frequent tree 
pattern in the database. On the other hand, the complete set of frequent tree 
patterns can be divided into m exclusive subsets: the i-th subset contains the 
frequent tree patterns having xi as the root. 

Example 2. Let us mine the frequent tree patterns in the tree database TDB in 
Figure 2. By scanning TDB once, we can find 4 length-1 /-patterns, i.e., A, B, 
C and D. Clearly, AO, BO, CO and DO are the four frequent tree patterns. On 
the other hand, the complete set of tree patterns in the database can be divided 
into 4 exclusive subsets: the ones having A, B, C and D as the root, respectively. 

The remainder of XSpanner (line (2)-(5)) mines the subsets one by one. 

For each length- 1 /-pattern Xi, XSpanner Qist outputs a frequent tree pattern 
XiO (line (3)). Clearly, to find frequent tree patterns having x^O as a root, we 
need only the trees containing Xi. Moreover, for each tree in the tree database 
containing Xi, we need only the subtrees having Xi as a root. We collect such 
subtrees as the (xiO) -projected database. This is done in line (4) in the algorithm. 

In implementation, constructing a physical projected database can be expen- 
sive in both time and space. Instead, we apply the pseudo-projection techniques 
[9,8]. The idea is that, instead of physically constructing a copy of the subtrees, 
we reuse the trees in the original tree database. For each node, the tree id and a 
hyperlink are attached. By linking the nodes labelled Xi together using the hy- 
perlinks, we can easily get the (xiO)-projected database. Please note that such 
hyperlinks can be reused latter to construct other projected databases. 

Example 3. Figure 5 shows the BO-projected database. Basically, all the subtrees 
rooted at a node B are linked, except for the leaf nodes labelled B. The leaf node 
labelled B in the left tree is not linked. 
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Fig. 5. The _BO-projected database 




Fig. 6. The CO-projected database 



Why the leaf nodes should not be linked? The leaf node cannot contain any 
embedded subtree larger than the tree pattern we have got so far. In other words, 
not linking such leaf nodes will not affect the result of the mining. Thus, it is 
safe to prune them in the projected database. 

Please note that there can be more than one node with the same label in a 
tree, such as the tree at the middle of Figure 5. These subtrees are processed as 
follows: a node labelled with B is linked only if it is not a descendant of some 
other node that is also labelled with B. As another example. Figure 6 shows the 
CO-projected database. In the tree at the middle, only the root node is linked. 

As shown in Figure 5, more than one subtree from a tree in the tree database 
may be included in the projected database. When counting the support of a tree 
pattern, such as BC, the pattern should gain support 1 from the same original 
tree, even it may be matched in multiple subtrees. 

The projected databases can be mined recursively. In the (xiO)-projected 
database, the length-1 ^-patterns should be found. For each length-1 /-pattern 
Xj, XiOxjl is a frequent tree pattern. The set of frequent tree patterns having Xi 
as a root can be divided into smaller subsets according to Xj ’s. 

In general, suppose that P is a frequent tree pattern and P-projected 
database is formed. Let S be the /-pattern of P. By scanning the P-projected 
database once, XSpanner finds frequent items in the projected database. Then, 
for each frequent item Xj, XSpanner checks whether Sxi is an /-pattern in the 
P-projected database and there exists a frequent tree pattern corresponds to 
Sxi- If so, then the new frequent tree pattern is output and the recursive mining 
continues. Otherwise, the search in the branch is terminated. 

The correctness of the algorithm can be proved. Limited by space, we omit 
the details here. 



5 Experiments and Performance Study 

In this section, we will evaluate the performance of XSpanner and Chopper in 
comparison with TreeMinerV [13]. All the experiments are performed on a Pen- 
tium IV 1.7GHz PC machine with 512MB RAM. The OS platform is Linux Red 
Hat 9.0 and the algorithms are implemented in C-|— 1-. 
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(a) Sup vs. time (b) Size vs. time (c) Height vs. time (d) Fanout vs. time 
Fig. 7 . Results on synthetic data sets 



We wrote a synthetic data generation program to output all the test data. 
There are 8 parameters for adjustment: the number of the labels |S'|, the proba- 
bility threshold of one node in the tree to generate children or not p, the number 
of the basic pattern trees (BPT) \L\, the average height of the BPT |/|, the 
maximum fanout (children) of nodes in the BPT \C\, the data size of synthetic 
trees \N\, the average height of synthetic trees \H\ and the maximum fanout of 
nodes |F| in synthetic trees. The actual height of each (basic pattern) tree is 
determined by the Gaussian distribution having the average of |i/|(|/|) and the 
standard deviation of 1. 

At first, we consider the scalability with minsup of the three algorithms, while 
other parameters are:S' = 100, p = 0.5, L = 10, / = 4, C = 3, TV = 10000, TV = 
8,F = 6. Figure 7(a) shows the result, where the minsup is set from 0.1 to 
0.004. In this figure, both X and Y axes have been processed by log^g T for the 
convenience of observation. 

From the figure 7(a), we can find that when support threshold is larger than 
5%, the three algorithms perform approximately the same. With the threshold 
becoming smaller, the algorithm XSpanner and Chopper begin to outperform 
TreeMinerV. What should be explained here is the reason when the threshold 
changes from 2% to 1%, the running time went up suddenly. We choose 10000 
trees as our test dataset; however, there are only 100 node labels. So much more 
frequent structures are generated, which leads to the greater time consuming. 
During this period, we can find that TreeMinerV runs out of memory, while 
XSpanner and Chopper retain the ability to finish computing. This is because of 
the large amount of candidates generated by Apriori-based algorithms. It is also 
clear that when the threshold continues to decrease, XSpanner surpasses Chop- 
per in performance. XSpanner estimates and differentiates between the isomers 
during the generation of frequent sequences, which can reduce a large number of 
frequent sequences but infrequent substructures generated when the threshold 
is low. 

Then, figures 7((b) shows the scalability with data size. The data size TV 
varies from 10000 to 50000, while other parameters are:S' = 100, p = 0.5, L = 
10, / = 4, C = 3, TV = 8, F = 6, minsup = 0.01. Here we find the cost of both 
time and space of XSpanner and Chopper is extremely smaller than that of 
TreeMinerV which is halted for memory overflow. The reason is that XSpanner 
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(a) Sup vs. time (b) Sup vs. # patterns (c) Sup vs. avg # nodes 



Fig. 8. Results on real data sets 



and Chopper can save time and space cost by avoiding false candidate subtree 
generation. 

Finally, the scalability with tree size is shown in Figures 7(c) and (d). The 
Tree size of H and F varies, while other parameters are: S = 100, p = 0.5, L = 
10, / = 4, C = 3, = 10000, mmsup = 0.01. In figure 7(c), we only vary H 

from 6 to 9. It is easy to find that, when H equals to 6 or 7, the performance of 
XSpanner and Chopper is better than that of TreeMinerV. However, when the 
trees become higher, the superiority grows. In particular, when H equals to 8 or 
9, the two algorithms thoroughly defeat TreeMinerV for the reason that TreeM- 
inerV is halted for memory overflow. In figure 7(d), the performance of the two 
algorithms and TreeMinerV is similar to the case above. XSpanner and Chopper 
performs better than TreeMinerV, while the fanout continues to increase. 

From all of the experiments we did, we can conclude that it is an acceptable 
and efficient way to put the process of distinguishing isomers in the process of 
generating frequent sequences. 

We also tested XSpanner and Chopper in Web Usage Mining. We downloaded 
the Weblog of Hyperreal (http://music.hyperreal.org), chose those dating from 
Sep. 10 to Oct. 9 in 1998 as the input data, and then transformed the Weblog 
into tree-like data set which includes 12000 more records totally. 

Figure 8(a) shows the performance of the three algorithms, where the minsup 
is set from 0.1 to 0.0003. In this figure, both X and Y axes have been processed 
by logj^g T for the convenience of observation. We can find, that the performance 
of XSpanner and Chopper is better than that of TreeMinerV. Especially, TreeM- 
inerV is halted in 3 hours for memory overflow when minsup = 0.0006, while 
the two algorithms go well. From the figure, we notice that the performance 
of XSpanner is more stable than that of Chopper. With threshold decreasing, 
Ybparmer surpasses U/iopper gradually. It should also be noted that, XSpanner 
does not perform excellently until minsup is dropped to 0.0003. 

Finally, figure 8(b) shows number of frequent patterns generated by the al- 
gorithm, while figure 8(c) shows average number of nodes of frequent patterns 
generated by the algorithm, where the minsup is set from 0.1 to 0.0003. 



Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining 451 



6 Conclusions 

In this paper, we present two pattern-growth algorithms, Chopper and XSpanner, 
for mining frequent tree patterns. In the future, we would like to explore pattern- 
growth mining of other complex patterns, such as frequent graph patterns. 
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Abstract. Previous work on XML association rule mining focuses on 
mining from the data existing in XML documents at a certain time point. 
However, due to the dynamic nature of online information, an XML doc- 
ument typically evolves over time. Knowledge obtained from mining the 
evolvement of an XML document would be useful in a wide range of 
applications, such as XML indexing, XML clustering. In this paper, we 
propose to mine a novel type of association rules from a sequence of 
changes to XML structure, which we call XML Structural Delta Associ- 
ation Rule (XSD-AR). We formulate the problem of XSD-AR mining by 
considering both the frequency and the degree of changes to XML struc- 
ture. An algorithm, which is derived from the FP-growth, and its opti- 
mizing strategy are developed for the problem. Preliminary experiment 
results show that our algorithm is efficient and scalable at discovering a 
complete set of XSD-ARs. 



1 Introduction 

XML is rapidly emerging as the de facto standard for data representation and 
exchange on the Web. With the ever-increasing amount of available XML data, 
the data mining community has been motivated to discover valuable knowledge 
from collections of XML documents. As one of the most important techniques of 
data mining, association rule mining has also been introduced into XML repos- 
itory [2] [4]. Currently, two types of data in XML has been exploited to mine 
rules: XML content and XML structure. The former aims to discover associa- 
tions between frequent data values [2]; whereas the latter focuses on discovering 
associations between frequent substructures [4]. Existing work on XML associ- 
ation rule mining study the data (content or structure) in XML documents at 
a certain point in time. However, due to the dynamic nature of online informa- 
tion, XML documents typically evolve over time. Changes to XML documents 
can be divided similarly into changes to XML content (also called content deltas, 
i.e., modifications of element values) and changes to XML structure (also called 
structural deltas, i.e., additions or deletions of elements). From the sequence of 
historical versions of the XML file, users might be interested in the associations 
between content or structural deltas. In this paper, we focus on mining associ- 
ation rules from XML structural deltas, which we refer to as XSD-AR (XML 
Structural Delta Association Rule). Given a sequence of historical versions of 
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an XML document, the goal of XSD-AR mining is to discover when some sub- 
structures of the XML document change, some other substructures change as 
well with certain probability. The contributions of this paper are summarized 
as follows: a)A new problem of association rule mining from structural deltas 
of historical XML documents is formally defined; b)An algorithm, derived from 
the FP-growth algorithm, and its optimizing strategy are developed to handle 
this problem; c) Experiments are conducted to evaluate the performance of the 
algorithm and the effect of the optimizing strategy. 

2 Problem Statement 

In this section, we first describe basic change operations which result in XML 
structural deltas. Then, we define several metrics to measure the degree of change 
and the frequency of change. Finally, the problem of XSD-AR is defined formally. 

An XML document can be represented as a tree according to Document 
Object Model (DOM) specification. In this paper, we will focus on unordered 
XML tree. In the context of XSD-AR mining, we consider the following two 
basic change operations: Ins ert(X (name, value), Y), which creates a new node 
X, with node name “name” and node value “value” , as a child node of node Y 
in an XML tree structure and Delete(X), which removes node X from an XML 
tree structure. 

Now we introduce the metrics which measure the degree of change for each 
single subtree and the frequency of change for a set of subtrees. Interesting 
substructure patterns and association rules can then be identified based on these 
metrics. 

Degree of Change. Let <ti, ti+i> be two historical versions of a subtree t 
in an XML tree structure T. Let | C) | be the number of basic change operations 
which change the structure of t from the ith version to the (i+f)th version. 
Then the degree of change for subtree t from version i to version {i-hl ) is: 



where fA f+i is the set of unique nodes of tree t in ith version and 



Frequency of Change. Let <Ti,T 2 ,...T„> be a sequence of historical ver- 
sions of an XML tree structure. Let <Z\i,Z\ 2 ,...Z\ti-i> be the sequence of deltas 
generated by comparing each pair of successive versions, where Ai {1 <i<n-l) 
consists of subtrees changed in two versions. Let 5 be a set of subtrees, S={ti, 
h,--- fm}, where V j {l<j<m), 3 i {l<i<n-l) s.t. tj G Ai. Let DoCj^ be the 
degree of change for subtree tj from jth version to (i-hl)th. version. The FoC of 
the set S is: 




version. 



□ 




1, if DoCj, yf 0 
0, if DoCj, = 0 



1 < j < mD 
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Weight. Furthermore, in order to measure how significantly subtrees in a set 
usually changes, we define Weight of a set, which compares the DoC of subtrees 
against some user-defined interesting value a: 



Weight (S) = 7— 

(n 



- 1) * FoC{S) ’ 



m 

where Di = Dj^ and 
i=i 



r l,ifDoCj. > a 
\ 0, otherwise 



I < j < mD 



Frequent Subtree Pattern. Then, we define a set of subtrees S={ti, t^, ■■■, 
tm} as a, frequent subtree pattern if subtrees in the set frequently change together 
and they usually change significantly when they change together. That is, 

— FoC of the set is no less than some user defined minimum FoC /3, FoC(S) 
> /?• 

— Weight of the set is no less than some user defined minimum Weight 7, 

weight(S) >7. □ 



Confidence. XSD-ARs then can be derived from frequent subtree patterns. The 
metric Confidence reflects the conditional probability that significant changes of 
a set of subtrees lead to significant changes of another set of subtrees: 



Confidence{X Y) 



FoC{X U F) * Weight{X U F) ^ 
^ FoC{X)*Weight{X) 



Problem Definition. The problem of XSD-AR mining can be formally stated 
as follows: Let <Ti,T 2 ,... Tn> be a sequence of historical versions of an XML 
tree structure. Let <L\i, A 2 , ...,Z\„_i> be the sequence of deltas. A structural 
delta database SDDB can be formed, where each tuple <DID, SID, DoC> 
comprises of a delta identifer, a subtree identifier and a degree of change for the 
subtree. Let S={ti,t 2 ,...tm} be the set of changed subtrees s.t. each A F S. An 
XSD-AR: A 5 is an implication between two subtree sets A and B where 
A C\B = 0. Given a FoC threshold (3, a Weight threshold 7 and a Confidence 
threshold the problem of XSD-AR mining is to find the complete set of 
XSD-ARs: {A^B\FoC{AUB)>/3, Weight(AUB)>j, Confidence(A^B)> ^}. 

3 Algorithm: Weighted-FPgrowth 

In this section, we present the procedure of mining XSD-ARs. Given a sequence 
of historical versions of an XML document, three phases are involved in mining 
the set of XSD-ARs: a) SDDB construction; b) Frequent Subtree Pattern Dis- 
covery and c) XSD-ARs Derivation. Since phase a) and c) can be handled in a 
straightforward way based on known conditions, subsequent discussion will be 
focused on phase b) to discover the set of frequent subtree patterns. 
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Fig. 1. Weighted FP-tree and Conditional Weighted-FPtree 



Data Structure. Consider that each subtree in a delta has two states: its DoC 
is either less than the user-defined minimum DoC a or no less than a. We use 
a pair of identifiers to represent the two states of a subtree. Given a subtree 
ti, if its DoC is less than a, we use to represent it in this delta] otherwise, 
we use the original identifier tt. For instance, suppose the threshold of FoC and 
Weight are 40% and 50% respectively. Weighted-FPtree constructed from the 
transformed SDDB in Figure 1 (a) is presented in Figure 1 (b). 

Mining Procedure. We now explain how to mine frequent subtree patterns 
from Weighted-FPtree. The critical differences between Weighted-FP growth and 
original FP-growth are the way we calculate the FoC and the weight for pat- 
terns and the way we construct conditional Weighted-FPtree. In the following, 
we illustrate the algorithm and the differences with an example. Consider the 
last subtree in header table, tg. FoC(tg) is 60% since the total number of occur- 
rence of tg and t!^ in Weighted-FPtree is three. Weight(tg) is 66% since tg occurs 
twice. Hence, tg is frequent. There are three paths related to tg: <if^:l,tQ:l,ifg:l >, 
<if^:l,t2:l,tg:l>, and <tg,:l,tg:l,tg:l> (the number after colon indicates the 
number of deltas in which the subtrees change together with tg). Since the min- 
imum FoC is 40% and the minimum weight is 50%, subtrees tg and tg may 
probably be frequent patterns with subtree tg. Hence, the three prefix paths: 
{fg:l,tg:l }, {tfgC }, and {tgCjtg:! }, forms tg ’s conditional pattern base. We need 
to construct a conditional Weighted-FPtree for subtree tg from the three prefix 
paths. Note that in the first prefix path, subtree tg occurs as fg, which means 
subtree tg does not change significantly with subtree tg and tg in this path. To 
record this fact, we need to replace tg with tfg. In fact, when constructing condi- 
tional Weighted-FPtree for any node with identifier tj in paths where subtree 
ti occurs as should be replaced with . Then the conditional Weighted-FPtree 
of tg is shown in Figure 1 (c). Then frequent patterns related to tg can be mined 
in the same way. We consider to optimize the space cost of Weighted-FPtree 
by registering information in edges. Specifically, when a subtree changes signif- 
icantly in this delta, the edge in Weighted-FPtree connecting it to its child is 
labelled as positive. Otherwise, the edge is labelled as negative. Then the num- 
ber of nodes are reduced because any node in optimized Weighted-FPtree will 
never have more than one child node representing the same subtree, with which 
it changes significantly (insignificantly). 
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(c) Variation of number of deltas Enhancement of Optimized WFPgrowth 



Fig. 2. Experiment Results 



4 Experiment Results 

In this section, we study the performance of the proposed algorithms. The al- 
gorithms are implemented in Java. Experiments are performed on a Pentium 
IV 2.8GHz PC with 512 MB memory. The operating system is Windows 2000 
professional. We implemented a synthetic SDDB generator by extending the one 
used in [1]. The default value for the set of deltas is lOK, the average size of a 
delta is 20 and the number of changed subtrees is 1000. 

Methodology and Results. We carried out two experiments to evaluate the 
efficiency of Wegithed-FP growth by varying thresholds on FoC and Weight re- 
spectively. As shown in Figure 2 (a), Weighted-FP growth and its optimized ver- 
sion are more efficiency when the threshold turns to be greater. As shown in 
Figure 2 (b), the efficiency of both versions are similar and insensitive to the 
variation of minimum Weight. The scale-up features of the algorithms are tested 
against the number of deltas, which is varied from 40K to 200K. As shown in Fig- 
ure 2 (c), when the number of deltas is larger, the more efficiency the optimized 
version can obtain by saving the number of nodes in the tree structure. Addition- 
ally, we calculate the compression ratio of the size of optimized Weighted-FP tree 
to the size of Weighted- FPtree against the mean value {wjpctg) of the number 
of subtrees which has changed greatly in a delta (0.4-0. 8). As shown in Figure 2 
(d), the compression ratio can be as high as 68%. 



5 Conclusions and Future Work 

This paper proposed a novel problem of association rule mining based on changes 
to XML structures: XSD-AR. An algorithm, Weighted-FP growth, and its opti- 
mizing strategy are designed to handle it. Experiment results demonstrated the 
efficiency and scalability of the algorithm. As ongoing work, we would like to 
collect real life data set to verify the semantic meaning of discovered XSD-AR. 
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Abstract. Data mining has attracted a lot of research efforts during 
the past decade. However, little work has been reported on supporting 
a large number of users with similar data mining tasks. In this paper, 
we present a data mining proxy approach that provides basic services 
so that the overall performance of the system can be maximized for fre- 
quent itemset mining. Our data mining proxy is designed to fast respond 
a user’s request by constructing the required tree using both trees in the 
proxy that are previously built for other users’ requests and trees stored 
on disk that are pre-computed from transactional databases. We define 
a set of basic operations on pattern trees including tree projection and 
tree merge. Our performance study indicated that the data mining proxy 
significantly reduces the I/O cost and CPU cost to construct trees. The 
frequent pattern mining costs with the trees constructed can be mini- 
mized. 



1 Introduction 

Data mining is a powerful technology being widely adopted to help decision mak- 
ers to focus on the most important nontrivial/predictive information/patterns 
that can be extracted from large amounts of data they continuously accumulate 
in their daily business operations. Finding frequent itemsets in a transactional 
database is a common task for many applications. Recent studies show that 
pattern-growth method is one of the most efficient methods for frequent pattern 
mining [1,2,3,4,5,6,7,8,9,10]. 

In this paper, we study a data mining proxy in a multi-user environment 
where a large number of users issue similar but different data mining queries 
from time to time. This work is motivated by the fact that in a large business 
enterprise a large number of users need to mine patterns according to their needs. 
Their needs may change from time and time, and a lot of similar queries are 
registered in a short time. It is undesirable to process these data mining queries 
one-by-one. In this paper, we focus on how to fast respond a user’s data mining 
query by constructing a smallest but sufficient tree using both trees in a data 
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mining proxy that are previously built for other users’ requests and trees stored 
on disk that are pre-computed for transactional databases. It has significant 
impacts on data mining. First, the I/O cost can be minimized, because trees do 
not need always to be constructed by loading data from disk. Second, the data 
mining cost can be reduced, because the mining cost largely depends on the tree 
size. 

The rest of the paper is organized as follows. Section 2 introduces frequent 
mining queries and tree building requests. Section 3 gives an overview of the 
data mining proxy, followed by the definition of tree operations. We will report 
our experimental results in Section 4, and conclude the paper in Section 5. 

2 Preliminaries 

A data mining query is to mine frequent patterns from a transaction database 
TDB. Let r be a given min-support threshold and V be an itemset. We consider 
three types of data mining queries: Frequent Itemset Mining Query, Frequent 
super-itemset Mining Query and Frequent sub-itemset Mining Query^. All data 
mining queries can be one of these three types or a combination of them. Each 
query is processed in two steps: i) constructing an initial PP-tree, and ii) mining 
on top of the PP-tree^ being constructed. We call the former a frequent tree 
building request, which can be done by constructing a tree from either TDB or 
a previously built tree. The three corresponding types of frequent tree building 
requests are given as follows. 

— Frequent Itemset Tree Building Request: constructing a tree in mem- 
ory which is smallest and sufficient to mine frequent patterns whose support 
is greater than or equal to r. 

~ Frequent Super-itemset Tree Building Request: constructing a tree 
in memory which is smallest and sufficient to mine frequent patterns that 
include items in V and have a support that greater than or equal to r. 
Frequent Sub-itemset Tree Building Request: constructing a tree in 
memory which is smallest and sufficient to mine frequent patterns that are 
included in V and have a support that is greater than or equal to r. 

3 Data Mining Proxy 

The data mining proxy is designed and developed to support a large number of 
users with similar data mining queries by fast responding a user’s tree building 
request. The efficiency of tree building requests is achieved by utilizing both 
trees in the data mining proxy that are previously built for other users’ requests 
and trees stored on disk that is pre-computed for transactional databases. In 
addition to efficiency issues, the effectiveness of data mining proxy is achieved 
by responding a smallest and sufficient tree for a users data mining query. It is 

^ The definitions of the three types of data mining queries are given in [10]. 

^ PP-tree is introduced in [10]. 
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because the cost of mining all possible patterns in a tree heavily depends on the 
tree size. The smaller the tree is, the less mining cost it occurs. 

Let TDB be a transaction database that includes items in / = 

{xi,X 2 , ■ ■ ■ , Xn}, where Xi is a 1 -itemset. 

Definition 1. Given a minimum support t and a set of items V Q I. Let T, 
T\ and T 2 be PP-trees. Three tree operations are defined. 

— A sub-projection operation, denoted tTt.v{T), is to project a subtree from the 
given tree T . The resulting tree includes all itemsets in V whose minimum 
support is greater than or equal to r. 

— A super-projection operation, denoted irry{T), is to project a subtree from 
the given tree T . The resulting tree includes all itemsets which are a super 
set of V and have a minimum support that is greater than or equal to r. 

— A merge operation, denoted Ti 0 12 , is to merge two trees, Ti and T 2 , and 

results a new tree. 

Given two PP-trees, Ti and Tj, we say Ti C Tj if Ti is a subtree of Tj. More 
precisely, by Ti C Tj we mean that every itemset, X, represented in Ti is also 
in Tj. As examples, if Ti = -KryiTj) then Ti C Tj, and if Tk = Ti (B Tj, then 
Ti C Tk and Tj C Tk. Let T/ be the largest PP-tree that include every single 
Xi G I appearing in TDB. Obviously, any tree is a subtree of Tj. Consider a data 
mining query, q, and a mining algorithm, M . We denote the resulting frequent 
patterns for q as For two PP-trees, Ti and Tj, we say Ti =g Tj if Mq{Ti) 

is the same as Mg(Tj). In other words, the two trees, Ti and Tj, will give the 
same frequent patterns for the same query q. The smallest tree for a data mining 
query is defined below. 

Definition 2. For a data mining query q with a minimum support r and a set 
of items V, the smallest tree, T (C T/), is a tree that satisfies the condition of 
T =q Tj, but does not satisfy T =g Tj if any item is removed from T. 

A simple scenario using the tree operations to show the proxy functions is 
shown in Figure 1. Suppose we have constructed T\ on disk, which is a PP-tree 
of a transaction database. T 2 , T 3 and T^ are PP-trees constructed one by one in 
memory for three data mining queries. T 2 is the result of sub-projection of c, e 
on T\. The merge operation on Ts and {d,g}{Ta) results in T3. T4 is the result 
of super-projection on T 2 . 

Remark 1. It is important to know that, for a given data mining query q, a 
sequence of tree operations can be easily identified that results in a smallest 
PP-tree for q. All the resulting trees shown in Figure 1 are the smallest trees to 
respond the corresponding data mining query. 



4 Performance Studies 

Some experiments results are shown in Figure 2 and 3. We conducted our ex- 
periments on three datasets: T25 . 120 .DIOOK with lOOK items, lOK items and 
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(b) (c) n = (d) T4 = 

T 2 = ©(r2,7r4,{d,g}(7l)) ■^4,{6}(21)) 

’^4,{c,e}{Tl) 



Fig. 1. Four Tree Projections (Suppose Ti is the PP-tiee materialized in disk) 




MtntSupport Range 



(a) lOOK items 
(Sparse) (r^ = 0.21, 
SMB cache) 




Mini-Support Range 



(b) lOK items 
(Medium) (r^ = 1.1, 
6.5MB cache) 




(c) IK items (Dense) 
{Tm = 6.8, 18MB 

cache) 



Fig. 2. Various Queries Patterns (1,000 queries, 60% are mini-support queries, 20% 
sub-itemset queries (5-10 items), and 20% super-itemset queries (1-5 items) 




0-50 50-100 100-150 150-200 200-250 250-300 



(a) lOOK items 
(Sparse) (r^ = 0.21, 
5-15 subitems) 




(b) lOK items 
(Medium) (r^ = 1.1, 
5-15 subitems) 




(c) IK items (Dense) 
{Tm = 6.8, 5-10 

subitems) 



Fig. 3. Sub-itemset Queries (non-overlapping sliding window of [Rmin, Rmax], 1,000 
queries, all sub-itemset queries) 
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IK items respectively. In the figures, the Y coordinates show the time cost of a 
test consisting of 1000 randomly generated tree building requests. 

In Figure 2, P and NP mean with and without the data mining proxy. For 
each test we selected a range of minimum supports, R = [Jimin, Rmax], which 
controlled the percentage of the minimum supports in that range. Three cases 
are considered, 100%, 80% and 50%, the percentage of min-support falling in R, 
denoted P/NP-100, P/NP-80 and P/NP-50. In all cases, the performance using 
the proxy outperforms the one without using the proxy with the same setting. 

In Figure 3 we focused on sub-itemset building requests. Sliding windows 
of items range were used to test the proxy. We varied the proxy’s size, which 
is indicated by the label of bar in the figure. As shown in the results, if the 
memory size was too small, the proxy did not work well, while if the cache size 
was appreciate, the performance of the proxy is very good. 



5 Conclusion 

In this paper, we proposed a data mining proxy to support a large number of 
users’ mining queries, and focused on how to build the smallest but sufficient 
trees in memory efficiently for mining. Three tree operations were proposed: sub- 
projection, super-projection and merge. The proxy maximize the usage of trees 
in memory, and minimize the I/O costs. Our experiments showed that the data 
mining proxy is effective because in-memory tree operations can be processed 
much faster than loading subtrees from disk. 
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Abstract. Ontologies play a key role in creating machine-processable Web 
content to enable the Semantic Web. Extracting domain knowledge from data- 
base schemata can profitably support ontology development and then semantic 
markup of the instance data with the ontologies. The Entity-Relationship (ER) 
model is an industrial standard for conceptually modeling databases. This paper 
presents a formal approach and an automated tool for translating ER schemata 
into Web ontologies in the OWL Web Ontology Language. The tool can firstly 
read in an XML-coded ER schema produced with ER CASE tools such as Pow- 
erDesigner. Following the predefined knowledge-preserving mapping rules 
from ER schema to OWL DL (a sublanguage of OWL) ontology, it then auto- 
matically translates the schema into the ontology in both the abstract syntax and 
the RDF/XML syntax for OWL DL. Case studies show that the approach is fea- 
sible and the tool is efficient, even to large-scale ER schemata. 



1 Introduction and Motivation 

Unlike the current Web, where content is primarily intended for human consumption, 
the Semantic Web[3] will represent content that is also machine processable in order 
to provide better machine assistance for human users in tasks and enable a more open 
market for information processing and computer services. Ontologies play a key role 
in creating such machine-processable content by defining shared and common domain 
theories and providing a controlled vocabulary of terms, each with an explicitly 
defined and machine-processable semantics[3]. The importance of ontologies to the 
Semantic Web has prompted the establishment of the normative Web ontology and 
semantic markup language OWL[5]. Besides, many kinds of ontology tools have been 
built to help people create ontologies and machine-processable Web content. These 
include, among others, ontology development tools[6\ such as Protege-2000 and On- 
toEdit that can be used for building or reusing ontologies, and ontology-based 
annotation tools{6\ such as OntoMat-Annotizer and SMORE that allow users 
inserting and maintaining (semi)automatically ontology-based markups in Web pages. 

Although many tools exist, acquiring domain knowledge to build ontologies re- 
quire making great efforts. For this reason, it is necessary to develop methods and 
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techniques that allow reducing the effort necessary for the knowledge acquisition 
process, being this the goal of ontology learning. Maedche and Staab[12] distinguish 
different ontology learning approaches that focus on the type of input used for 
learning, i.e., ontology learning from text, from dictionary, from knowledge base, 
from semi-structured schemata and from database schemata. Database schemata, in 
particular the conceptual schemata modeled in semantic data models such as the 
Entity-Relationship (ER) model contain (implicitly) abundant domain knowledge. 
Extracting the knowledge from them can thus profitably support the development of 
Web ontologies. This is one aspect which motivates our research. 

Another motivation of our work concerns the semantic markup of Web content. All 
existing ontology-based annotation tools can only annotate static HTML pages so far'. 
Nowadays, however, Web-accessible databases are main content sources on the cur- 
rent Web and the majority of Web pages are dynamic ones generated from data- 
bases[9]. How to “upgrade” these dynamic Web content to Semantic Web content 
remains to be an open problem. This is therefore one of the aims of the ongoing 1ST 
Project Esperonto Services^. The project has imagined a general “semantic wrapper” 
method which leaves the content in the database and annotates the query that retrieves 
the concerned content. Other alternatives exist. Eor instance, Stojanovic et a/[14] have 
presented an “instance migration” method which extracts the instance data from a 
database and stores the content as an RDE file on the Semantic Web. The method has 
an obvious shortcoming because it assumes that the database is static and no data up- 
date occurs. Handschuh et al[9] have thus introduced a “deep annotation” method. 
The method keeps the instance data remaining in the database. Client queries are exe- 
cuted based on the mapping rules between the client ontology and the database with 
the assumption that the Web site (database owner) is willing to provide the informa- 
tion structure of the underlying database and produce in advance server-side Web 
page markups according to the information structure. No matter what methods are 
adopted to “upgrade” a database to Semantic Web content, we argue that a common 
issue and a precondition here are that we have an ontology at hand that can conceptu- 
ally capture the knowledge of the domain of discourse of the database. Usually, data- 
base construction begins with ER modeling supported by CASE tools. Therefore, it is 
necessary to develop tools for translating ER schemata into OWL ontologies. 

In the paper, we will present an automated tool ER2WO which is based on a for- 
mal approach and performs the automatic translation from ER schemata to OWL on- 
tologies. The remainder of the paper is organized as follows. Eirst, we introduce the 
formal approach in section 2, with the focus on how to define the knowledge- 
preserving mapping from an ER schema to an OWL ontology. Next, in section 3, we 
deal with tool implementation and case studies with the intention of proof-of-concept 
of our approach. Section 4 is the related work, and the last section the conclusions. 



* See Semantic Web - Annotation & Authoring homepage http://annotation.semanticweb.org. 
^ http://www.esperpnto.net/. 
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2 Formal Approach 

To perform the translation form ER schemata to OWL ontologies, we must first es- 
tablish the correspondence and then define the mapping between the two knowledge 
representation languages, i.e., the ER model and the OWL language. 

Early researches, such as [4], on representing and reasoning on database concep- 
tual schemata have introduced several formalization means to represent an ER 
schema. The formalized ER schema can then be translated into a knowledge base in 
Description Logics (DLs)[l] (such as ALUNI{A]) and it has been proved that the 
translation preserves the semantics. 

The OWL language is developed as a vocabulary extension of RDF and provides 
three increasingly expressive sublanguages Lite, DL and FULL, designed for use by 
specific communities of implementers and users [5]. OWL DL can be approximately 
viewed as the expressive DL SHOIN(D) which is more expressive than ALUM, with 
an OWL DL ontology being equivalent to a SHOIN(D) knowledge base, and the set of 
axioms of the ontology being equivalent to the set of assertions of the knowledge 
base[ll]. Therefore, we believe that there exists a formal and semantics-preserving 
approach for translating an ER schema into an OWL DL Ontology. 



2.1 Formalization of ER Schemata 

We adopt the first-order formalization of the ER model introduced by Calvanese et al 
in [4]. This formalization includes the most important features present in the different 
variants of the ER model supported by existing CASE tools and has a well-defined 
semantics, which makes it possible to establish a precise correspondence with the 
Web ontology language OWL DL. In the following. Definition 1 and 2, we give the 
formal syntax of an ER schema that was presented in [4], with our minor changes of 
the notation and an augment of key attributes^ in the model. 

Definition 1. For two finite sets X and Y, we call a function from a subset of A to T an 
X-labeled tuple over Y. The labeled tuple T which maps X to y.e Y, for i=l, ..., k, 
is denoted [x^ : y^, ..., x^ : yj. 

Definition 2. An ER schema is a tuple S = {L^, isa^ att^, rd^ card^, where 

• is a finite alphabet partitioned into a set of etitity symbols, a set of attribute 
symbols, a set 11^ of ER-role symbols, a set of relationship symbols, and a set 
of domain symbols', each domain symbol Dg has an associated predefined basic 
domain B°, and the various basic domains are assumed to be pairwise disjoint. 

• wfljC E^x E^ is a binary relation which models the IS-A relationship between enti- 
ties. 

• att^ is a function that maps each entity symbol in E^ to an -labeled tuple over 
For an entity Eg E^ such that attJJL) = [..., A : D, ...], if there exists an attribute A 
such that it can have only one (unique) value in D’s basic domain for each in- 
stance of E, then the attribute A is called a key attribute of entity E. The function is 
used to model attributes of entities. For simplicity, we assume here that all attrib- 



^ Here we only consider single-attribute keys and do not consider the individual attributes 
composed to form a composite-key of an entity. 
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utes are single-valued and mandatory, but we can also easily handle the situation 
beyond this assumption. 

• is a function that maps each relationship symbol in ^ to an 77^-labeled tuple 
over Ej. We assume without loss of generality that: 

1 . Each ER-role is specific to exactly one relationship. 

2. Eor each ER-role Ue U^, there is a relationship Re ^ and an entity Ee such 
that ref/R) = [..., U : E, ...]. 

The function actually associates a set of ER-roles to each relationship, determining 
implicitly also the arity of the relationship. 

• card^ is a function from to N„x(N|U{°o}) (where denotes 

nonnegative integers, N, positive integers) that satisfies the following condition: 
for a relationship Re such that rcCJJi) = [U^ : E^, ..., : EJ, cardJyE, R, U) is de- 

fined only if [/ = t/, for some i= 1, and if E isa^ E. (where isa^ denotes the re- 
flexive transitive closure of isa). The first component of cardjji, R, U) is denoted 
with min_cardj(E, R, U) and the second component with ma7(_cardjji, R, U). If not 
stated otherwise, min_cardj(E, R, U) is assumed to be 0 and ma?(_cardjji, R, U) is as- 
sumed to be oo . The function card^ is used to specifies cardinality constraints, i.e., 
constraints on the minimum and maximum number of times an instance of an en- 
tity may participate in a relationship via some ER-role. 

The semantics of an ER schema can be given by specifying database states 
consistent with the information structure expressed by the schema (see [4]). 



2.2 OWL DL Language and Ontology 

OWL DL has two types of syntactic form. One is the exchange syntax[5], i.e., the 
RDE/XML syntax, which represents an ontology as a set of RDE triples for the 
purpose of publishing and sharing the ontology over the Web. Another form is the 
frame-like style abstract syntax{\3'\, where a collection of information about a class 
or property is given in one large syntactic construct, instead of being divided into a 
number of atomic chunks (as in most DLs) or even being divided into more triples as 
when using the exchange syntax. The abstract syntax is abstracted from the exchange 
syntax and thus facilitates access to and evaluation of the ontologies, being this the 
reason for presenting our formal approach using this syntax in the paper. 

OWL uses a DL style model theory to formalize the meaning of the language[13]. 
The underlying formal foundation of OWL DL is the DL SHOIN(D) and the 
semantics for OWL DL is based on interpretations[ll], where an interpretation con- 
sists of a domain of discourse and an interpretation function. The domain is divided 
into two disjoint sets, the individual domain and the data-value domain Aj, . The 
interpretation function I maps classes into subsets of A^ , individuals into elements of 
A^ , datatypes into subsets of a{, and data values into elements of A^ . In addition, 
two disjoint sets of properties are distinguished: object properties and data type prop- 
erties. The interpretation function maps the former into subsets of A^xA^ and the 
latter into subsets of . The interpretation function is extended to concept ex- 

pressions in the language. 
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Information in OWL is gathered into ontologies, which can then be stored as 
documents in the RDF/XML syntax on the Web. The document consists of an op- 
tional ontology header plus any number of class axioms, property axioms, and indi- 
vidual axioms (i.e., facts about individuals). An OWL DL ontology in the abstract 
syntax, started with an optional ontology ID, contains simply a sequence of annota- 
tions (optional) and axioms. Regardless of using which syntax form, the formal 
meaning of the ontology is solely determined by the underlying RDF graph of the 
ontology, which is interpreted by the model-theoretic semantics for the language[13]. 

2.3 Mapping from ER Schemata to OWL DL Ontologies 

The formal approach for translating an ER schema into an OWL DL ontology follows 
a set of mapping rules, as specified in Definition 3. In fact, these mapping rules are 
induced by the semantic-preserving mapping rules from an ER schema to a DL 
knowledge base introduced in [4] and the semantical correspondence between an 
OWL DL ontology and a DL knowledge base[ll]. Therefore, the translation pre- 
serves the semantics of the ER schema. 

Definition 3. Let S = {L^ isa^ att^, reC^ card) be an ER schema. The OWL DL ontol- 
ogy O in the abstract syntax is defined by a translation function 0(5) = {I'Dg, a?(wm)), 
where 

• is a finite OWL DL identifier set partitioned into a subset CI'D^ of class identi- 
fiers, a subset DTI'D^ of data-valued property identifiers, a subset ItPIDg of individ- 
ual-valued property identifiers, and a subset ’UTI'Dg of datatype identifiers', each 
datatype identifier is a predefined XML Schema datatype'* identifier (that is used in 
OWL as a local name in the XML Schema canonical URI reference for the 
datatype), and 

• afiom^ is a finite OWL DL axiom set partitioned into a subset cafiom^ of class axi- 
oms and a subset pafiorttg of property axioms, and 

• and afiom^ are induced by the elements of S, following the mapping rules from 
an ER schema to an OWL DL ontology as depicted in Figure 1 . 

In the table of Figure 1 , the left column is the source of the mapping, right column 
the termination. Starting with the construction of the atomic identifiers (i.e., local 
names in URI references in the RDF/XML syntax), the approach induces a set of axi- 
oms from the ER schema, which forms the body of the target ontology. (To a par- 
ticular class, multiple class-axioms can be merged for conciseness in practice). The 
ontology can then be evaluated by the domain expert. 



2.4 Ontology Transform from Abstract Syntax to RDF/XML Syntax 

For the real use of the resulting ontology over the Web, it should be transformed into 
the exchange syntax. The W3C has specified a set of semantics-preserving mapping 



XML Schema datatypes (http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/) are 
used in OWL as built-in datatypes by means of the XML Schema canonical URI reference 
for the type, e.g., xsd : string, xsd : integer, and xsd: nonNegativeInteger. 
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ER Schema Elements 


OWL DL Ontology Elements (in the Abstract Syntax) 


Alphabet 


Identifier Set (Symbols E. E, A, and U can be renamed option- 

ally) 


Each entity symbol Ee. 


A description (exactly, class identifier) 0(E) e 




Each relationship symbol 
Re ^ 


A description (exactly, class identifier) 0(R)e 




Each attribute symbol A g 


A data-valued property identifier 0(A) e IXPED^ 




Each domain symbol De. 


A (XML Schema) datatype identifier 0(D) e lYII'D^ 




Each ER-role symbol Ub 


An individual-valued property identifier 0(t/)G 




Other Constructors 


Axiom Set a^om^ 


Each attribute Ae such 

that attJiE) = [..., A : D, ...] 


Create a property axiom: 

DatatypeProperty ( 0(A) domain ( 0(E) ) ran- 
ge! 0(E)) [Functional] ), 

where Functional occurs only if A is a key attribute as de- 
fined above 


(1) 


Each ER-role Ue Uj such 
that reClR) = [..., U : E, ...] 


Create a property axiom: 
ObjectProperty(0((/) domain ( 0(R)) 
range ( 0(E) ) ) 


(2) 


Each pair of entities 

E^, such that E^ isa^ E^ 


Create a class axiom 

subClassOf ( 0(-E,) 0{E^)) or Class (0(-E,) partial 
0(E,)) 


(3) 


Each entity Eg E^ such that 
att/E) = [Aj : E>„ A^ : E>J 


Create a class axiom: 

Class (0(-E) partial restriction { 0(A,) allVa- 
luesFrom(0(D,)) cardinality { 1 ) ) ... re- 

striction ( 0(AJ allValuesFrom(0(DJ) cardi- 
nality (1) ) ) 


(4) 


Each relationship 
such that 

retlR) = [f/, : £„ U, : £J 


Create a class axiom: 

Class (0(R) partial restriction { 0(f/,) allVa- 
luesFrom(0(£,)) cardinality ( 1 ) ) ... re- 

striction ( 0(C4) allValuesFrom { OlE,)) cardi- 
nality ( 1 ) ) ) , and 
For i = 1 .. kdo 

Create a property identifier P. e Il’I'D^ 

Create a property axiom: 

ObjectProperty{P, domain { (XE,)) ran- 
ge (0(R)) inverseOf 0(U)) , 

Create a class axiom: 

Class (0(E.) partial restriction (P, allVa- 
luesFrom { 0(E)) ) ) 


(5) 

(6) 
(7) 


Each relationship Eg 
such that 

ref/E) = [U, : E„ U, : EJ, 

for i = 1, k, and 
for each entity Eg E^, such 
that E isa^ E. 


For / = 1 .. k do (assume P. has been created with the axiom 
(6)) 

If m = min_cardJ(E, R, U) -d- 0 then create a class axoim: 
Class (0(E) partial restriction (P,,minCar- 
dinality (m) ) ) , 

If n = tna>(_cardJ(E, R, U) ^ then create a class axoim: 

Class (0(E) partial restriction (P, maxCar- 
dinality (n) ) ) 


(8) 

(9) 


Eor each pair of symbols 
X, Ye u such that 
FandAs 


Create a class axoim: 
DisjointClasses (0(^ 0{Y)) 


(10) 



Fig. 1. Mapping rules from an ER schema to an OWL DL ontology in the abstract syntax 
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rules from OWL DL abstract syntax to RDF/XML syntax in the OWL Semantics and 
Abstract Syntax document[13]. In the document, a model-theoretic semantics is given 
to provide a formal meaning for OWL ontologies written in the abstract syntax. A 
model-theoretic semantics in the form of an extension to the RDF semantics is also 
specified to provide a formal meaning for OWL ontologies as RDF graphs. A map- 
ping from the abstract syntax to RDF graphs is then defined in the document and it 
has been proved that the two model theories are shown to have the same conse- 
quences on OWL ontologies that can he written in the abstract syntax. We have sim- 
ply adopted the mapping rules to perform the syntax transform of the ontology. 



3 Prototype Tool and Case Study 

Based on the formal approach introduced above, we developed an automated tool 
ER2WO which can read in an XML-coded ER schema - currently, a conceptual data 
model (CDM) file produced from CASE tool PowerDesigner 9.5^ - and translate 
automatically the schema into an OWL DL ontology using the mapping rules and 
produce the resulting ontology in both the abstract syntax and the RDE/XML syntax. 



3.1 Design and Implementation of the Tool 

ER2WO is designed as consisting of 4 modules: parsing module which parses an 
XML-coded ER schema, translation module that translates the parsed schema into the 
ontology in OWL DL abstract syntax, transformation module that performs the ontol- 
ogy transformation from the abstract syntax to the RDF/XML syntax, and output 
module which produces the resulting ontology as a text file. 

The tool implementation is based on Java 2 vl.4.2 platform. The parsing module 
uses the SAX API for Java to parse the ER schema file and store the schema data as 
Java ArrayList classes. The translation module uses Java class methods to implement 
the mapping from the schema to the OWL DL ontology in the abstract syntax. To the 
best of our knowledge, there hasn’t so far existed any off-the-shelf tool for perform- 
ing the ontology transformation from the abstract syntax to the RDE/XML syntax. So 
we developed the transformation module using the XML presentation syntax for 
OWL specified in the document[10] as an intermediate format between the two syn- 
tax formats, and using the XSLT stylesheet (owlxml2rdf . xsl®) provided in the 
document for the transformation from the XML presentation syntax to the RDE/XML 
syntax. The screen snapshot of ER2WO vl.O is depicted in Eigure 2. In the figure, the 
left list-boxes display the parsed ER schema univ_dept in the case study below, 
the right text-area and the pop window display the resulting ontology in the abstract 
syntax and in the RDF/XML syntax respectively. 



^ http://www.sybase.com/products/enterprisemodeling/powerdesigner. 
® http://www.w3.org/TR/owl-xmlsyntax/owlxml2rdf.xsl. 
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Fig. 2. Screen snapshot of ER2WO 



3.2 Case Study 

We have carried out more than 10 case studies using ER2WO, with the scale of the 
ER schema ranging from 57 to 356’. Case studies show that the formal approach is 
feasible and the implemented tool is efficient, even to large-scale ER schemata. All 
resulting OWL DL ontologies have passed the syntactic validation by the OWL On- 
tology Validator (http://phoebus.cs. man. ac.ukiOOOO/OWLWalidator). Four selected 
cases including University Department, Project Management, Book Store and Library 
Management have been published at the homepage of ER2WO tool®. 

Saving space, Figure 3 shows a small-scale example ER schema univ_dept 
which models a university department with PowerDesigner 9.5. We notice here that 
PowerDesigner does not currently support A-ary relationships {N>2), or attributes on 



’ Here we measure the scale of an ER schema by summing up all the numbers of elements in 
the schema, including entities, (IS-A) relationships, attributes, ER-roles and constraints. 

® http://cse.seu.edu.cn/people/ysdong/graduates/--zmxu/ER2WO/. 
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relationships. Hence, the case does not include these ER features. In the figure, ‘TXT’ 
denotes datatype string, ‘I’ denotes integer, and ‘l,n’ denotes that the mini- 
mum cardinality is 1 and the maximum cardinality is oo . 




Fig. 3. An example ER schema univ_dept modeled with PowerDesigner 9.5 



Following the mapping rules, the ER schema is translated into OWL DL ontology 
0(univ_dept) = {I'Dg, a?;iom^,), where, is an OWL DL identifier set partitioned 
into: ={ Course, Employee, Student, Department, Faculty, Teach, 

Attend, Enrol, Offer, Work}; 2XT/®g={xsd : string, xsd : integer }; 
£>ff’/®j,={studentNo, studentName, deptNo, deptName, courselD, 
credit, empID, empName, title}; 7!P/®g={admitTo, admittedBy, en- 
rolOf, enrolledin, offererOf, offeredBy, workFor, employerOf, 
teacherOf, taughtBy, inv_admitTo, inv_admittedBy, inv_enrolOf, 
inv_enrolledIn, inv_of f ererOf , inv_of f eredBy, inv_workFor, 
inv_employerOf , inv_teacherOf , inv_taughtBy}, where identifiers with 
prefix inv_ denote the inverse of the corresponding property identifiers (e.g., 
inv_admitTo is the inverse of property identifier admitTo). 

Based on the identifier set, the OWL DL axiom set aTQom^ which forms the body of 
ontology cXuniv_dept) is presented in Appendix A. (It can also be found at the 
homepage of ER2WO tool). 



4 Related Work 

Two categories of approaches or tools are related to our work. One is the DL-based 
conceptual modeling approaches such as [4], and tools such as I.COM[7]. These early 
approaches and tools established the correspondence between DLs and the ER model, 
which form the formal foundation for our approach and tool. They focus on reasoning 
about the schema by translating it into a DL knowledge base. Whereas our tool aims 
at acquiring knowledge from existing ER schemata and then building and publishing 
OWL ontologies on the Web. 

Another category is the tools for ontology learning and instance-data migration 
from databases. According to recent surveys [2] and [8], there are two ongoing proj- 
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ects aim at developing tools for creating lightweight Weh ontologies and ontological 
instances from databases, i.e., the D2R MAP* * and KAON REVERSE[14]*° prototype 
tools. D2R MAP is a declarative language to describe mappings between relational 
database schemata and RDF(S) ontologies. The mappings can be used by a D2R 
processor to export data from a database to an RDF file. However the mapping defi- 
nition process is manual and requires domain-knowledge input from the modeler. Be- 
sides, D2R MAP focuses on exposing an RDF description of the relational database, 
not the conceptual entities which the relational description is attempting to capture [2]. 
KAON REVERSE is an early prototype for mapping relational database content to 
RDF(S) ontologies and instances. It is intended to be merged with the Harmonise 
Mapping Framework tool" and the user interface into the KAON OIModeler. The 
early prototype[14] takes the schema and instance of a domain specific relational da- 
tabase as the source, uses F-Logic predicates and axioms as the intermediate format, 
and adopts data reverse engineering and mapping approaches to produce the RDF(S) 
ontology and instance data. However it also needs some degree of human participa- 
tion and manual check, and because RDF(S) does not have enough expressive power 
to capture the knowledge and constraints of the domain the tool has inherent draw- 
backs [2]. Our work seems to be a useful attempt toward developing more formal 
methods and automated tools used to perform the ontology learning process and sup- 
port more expressive languages such as OWL. 



5 Conclusions 

The real power of the Semantic Web will be realized only when people create much 
machine-readable content, and ontologies play a key role in this effort. Given the fact 
that many industrial CASE tools can support both database forward and reverse engi- 
neering and produce ER schemata in exchangeable (e.g., in XML) format, extracting 
knowledge from database schemata and producing OWL ontologies by integrating 
our ER2WO tool with existing CASE tools can thus profitably support the develop- 
ment and reuse of Web ontologies, and then facilitate the semantic markup of dy- 
namic Web content generated from databases. 

In a broad sense, the Semantic Web needs to be able to share and reuse existing 
knowledge bases. Therefore, interoperability among different knowledge representa- 
tion systems is essential. In this sense, our ER2WO tool can act as a gap-bridge be- 
tween existing database applications or legacy systems and the Semantic Web. 
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* http://www.wiwiss.fu-berlin.de/suhl/bizer/d2rmap/D2Rmap.htm. 

http://kaon.semanticweb.org/alphaworld/reverse/view. 

* * http://sourceforge.net/projects/hmafra/. 
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Appendix A OWL DL ontology derived from the ER schema univ_dept 
(Eor brevity, the disjoint class axioms and annotations are omitted here.) 



Ontology (univ_dept 

Class (Attend partial restriction (admitTo allValuesFrorn(Departraent) cardinality (1) ) 
restriction (admitedBy allValuesFrom (Student) cardinality ( 1) ) ) 

Obj ectProperty (admitedBy domain (Attend) range (Student ) ) 

Obj ectProperty (inv_admitedBy domain (Student) range (Attend) inverseOf (admitedBy) ) 
Obj ectProperty (admitTo domain (Attend) range (Department) ) 

Obj ectProperty (inv_admitTo domain (Department ) range (Attend) inverseOf (admitTo) ) 
Class (Work, partial restriction (workFor allValuesFrom (Employee) cardinality (1) ) 
restriction (employerOf allValuesFrom (Department ) cardinality (1) ) ) 

Obj ectProperty (employerOf domain (Work) range (Department) ) 

Obj ectProperty (inv_employerOf domain (Department) range (Work) 
inverseOf (employerOf) ) 

Obj ectProperty (workFor domain(Work) range (Employee) ) 

Obj ectProperty (inv_workFor domain (Employee) range (Work) inverseOf (workFor) ) 

Class (Of fer partial restriction (of fererOf allValuesFrom (Department) 

cardinality (1) ) restriction (of feredBy allValuesFrom (Course) cardinality (1) ) ) 

Obj ectProperty (of feredBy domain (Of fer) range (Course) ) 

Obj ectProperty (inv_of feredBy domain (Course) range(Offer) inverseOf (offeredBy) ) 

Obj ectProperty (of fererOf domain (Of fer) range (Department) ) 

Obj ectProperty (inv_offererOf domain (Department ) range(Offer) inverseOf (of fererOf ) ) 
Class (Enrol partial restriction ( enrolOf allValuesFrom (Student) cardinality (1) ) 
restriction (enrolledin allValuesFrom(Course) cardinality ( 1) ) ) 

Obj ectProperty (enrolledin domain (Enrol ) range (Course) ) 

Obj ectProperty (inv_enrolledIn domain (Course) range (Enrol) inverseOf (enrolledin) ) 
Obj ectProperty (enrolOf domain (Enrol ) range (Student ) ) 

Obj ectProperty (inv_enrolOf domain (Student ) range(Enrol) inverseOf (enrolOf) ) 

Class (Teach partial restriction ( teacherOf allValuesFrom (Faculty) cardinality (1) ) 
restriction (taughtBy allValuesFrom(Course) cardinality (1) ) ) 

Obj ectProperty (taughtBy domain (Teach) range (Course) ) 

Obj ectProperty (inv_taughtBy domain (Course) range(Teach) inverseOf (taughtBy) ) 

Obj ectProperty (teacherOf domain (Teach) range (Faculty) ) 

Obj ectProperty (inv_teacherOf domain (Faculty) range(Teach) inverseOf (teacherOf ) ) 
Class (Student partial restriction (studentNo allValuesFrom(xsd: string) 
cardinality (1) ) restriction (studentName allValuesFrom (xsd: string) 
cardinality (1) ) restriction ( inv_admitedBy allValuesFrom (Attend) cardinality ( 1 ) ) 
restriction (inv_enrolOf allValuesFrom (Enrol) maxCardinality (30) ) ) 

Class (Department partial restriction (deptNo allValuesFrom(xsd: string) 

cardinality (1) ) restriction (deptName allValuesFrom (xsd: string) cardinality (1) ) 
restriction (inv_admitTo allValuesFrom (Attend) minCardinality (1) ) 
restriction (inv_employerOf allValuesFrom (Work) minCardinality ( 1 ) ) 
restriction (inv_of fererOf allValuesFrom (Of fer) minCardinality ( 1 ) ) ) 

Class (Course partial restriction (courselD allValuesFrom (xsd : string) 

cardinality (1) ) restriction (credit allValuesFrom (xsd: integer) cardinality (1) ) 
restriction (inv_of feredBy allValuesFrom (Of fer ) cardinality ( 1) ) 
restriction (inv_enrolledIn allValuesFrom (Enrol) minCardinality (10 ) 
maxCardinality (100 ) ) restriction (inv_taughtBy allValuesFrom (Teach) 
minCardinality (1) ) ) 

Class (Faculty partial Employee restriction (title allValuesFrom (xsd: string) 
cardinality (1) ) restriction ( inv_teacherOf allValuesFrom(Teach) 
minCardinality (1) ) ) 

Class (Employee partial restriction (empID allValuesFrom (xsd: integer) 

cardinality (1) ) restriction (empName allValuesFrom (xsd : string) cardinality (1) ) 
restriction (inv_workFor allValuesFrom (Work) cardinality (1) ) ) 

DatatypeProperty ( studentNo domain (Student ) range (xsd: string) Functional ) 
DatatypeProperty ( StudentName domain (Student) range (xsd: string) ) 

DatatypeProperty (deptNo domain (Department ) range (xsd: string) Functional) 
DatatypeProperty (deptName domain (Department) range (xsd: string) ) 

DatatypeProperty (courselD domain (Course) range (xsd : string) Functional) 
DatatypeProperty (credit domain (Course) range (xsd: integer) ) 

DatatypeProperty (title domain (Faculty) range (xsd: string) ) 

DatatypeProperty (empID domain (Employee ) range (xsd: integer) Functional) 
DatatypeProperty (empName domain (Employee) range (xsd: string) ) ) 
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Abstract. Condensed representations of pattern collections have been 
recognized to be important building blocks of inductive databases, a 
promising theoretical framework for data mining, and recently they have 
been studied actively. However, there has not been much research on how 
condensed representations should actually be represented. 

In this paper we propose a general approach to build condensed repre- 
sentations of pattern collections. The approach is based on separating 
the structure of the pattern collection from the interestingness values of 
the patterns. We study also the concrete case of representing the fre- 
quent sets and their (approximate) frequencies following this approach: 
we discuss the trade-offs in representing the frequent sets by the maximal 
frequent sets, the minimal infrequent sets and their combinations, and 
investigate the problem approximating the frequencies from samples by 
giving new upper bounds on sample complexity based on frequent closed 
sets and describing how convex optimization can be used to improve and 
score the obtained samples. 



1 Introduction 

Data mining aims to find something interesting from large databases. One of 
the most important approaches to mine data is pattern discovery where the 
goal is to extract interesting patterns (possibly with some interestingness values 
associated to each of them) from data [1,2]. The most prominent example of 
pattern discovery is the frequent set mining problem: 

Problem 1 (Frequent set mining). Given a multiset d = {di,...,d„} (a data 
set) of subsets (transactions) of a set R of attributes and a threshold value a € 
[0, 1], find the collection T (cr, d) = {X Q R : fr (X, d) > a} where fr (X, d) = 
\cover (A, d)| /n and cover (A, d) = {* : A C di, 1 < * < n}. 

The set collection T (cr, d) is called the collection of a -frequent sets in d. 

There exist techniques to efficiently compute frequent sets, see e.g. [3]. A ma- 
jor advantage of frequent sets is that they can be computed from data without 
much domain knowledge: The data set determines an empirical joint probabil- 
ity distribution over the attribute combinations and high marginal probabilities 
of the joint probability distribution can be considered as a reasonable way to 
summarize the joint probability distribution. (Note that also a sample from the 
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joint probability distribution is a quite good summary.) However, this generality 
causes a major problem of frequent sets: the frequent set collections that describe 
data well tend to be large. Although the computations could be done efficiently 
enough, it is not certain that the huge collection of frequent sets is very concise 
summary of the data. 

This problem of too large frequent set collections have been tried to solve by 
computing a small irredundant subcollection of the given frequent set collection 
such that the subcollection determines the frequent set collection completely. 
Such subcollections are usually called condensed representations of the frequent 
set collection [4]. The condensed representations of the frequent sets have been 
recognized to have an important role in inductive databases which seems to be 
a promising theoretical framework for data mining [5, 6, 7,8]. 

The condensed representations of frequent sets have been studied actively 
lately and several condensed representations, such as maximal sets [9], closed 
sets [10],/ree sets [11], disjunction-free sets [12], disjunction-free generators [13], 
non-derivable itemsets [14], condensed pattern bases [15], pattern orderings [16] 
and pattern chains [17], have been proposed. However, not much has been done 
on how the condensed representations should actually be represented although 
it is an important question both for the computational efficiency and for the 
effectiveness of the data analyst. 

In this paper we investigate how the patterns and their interestingness values 
can be represented separately. In particular, we study how to represent frequent 
sets and their frequencies: we discuss how the collection of frequent sets can be 
described concisely by combinations of its maximal frequent and minimal infre- 
quent sets, show that already reasonably small samples determine the frequencies 
of the frequent sets accurately and describe how a weighting of the sample can 
further improve the frequency estimates. 

The paper is organized as follows. In Section 2 we argue why describing 
patterns and their interestingness values separately makes sense. In Section 3 
we study a representation of interestingness values based on random samples of 
data and give sample complexity bounds that are sometimes considerably better 
than the bounds given in [18]. In Section 4 we describe how the samples can 
be weighted to optimally approximate the frequencies of the set collection w.r.t. 
a wide variety of loss functions and show experimentally that a considerable 
decrease of loss can be achieved. The work is concluded in Section 5. 

2 Separating Patterns and Interestingness 

Virtually all condensed representations of pattern collections represent the col- 
lection by listing a subcollection of the interesting patterns. This approach to 
condensed representations has the desirable closure property that also a subcol- 
lection of (irredundant) patterns is a collection of patterns. Another advantage 
of many condensed representations of pattern collections consisting of a list of 
irredundant patterns is that the patterns in the original collection and their inter- 
estingness values can be inferred conceptually very easily from the subcollection 
of irredundant patterns and their interestingness values. 
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The most well-known examples of condensed representations of pattern col- 
lections are the collections of closed a -frequent sets: 

Definition 1 (Closed cr-frequent sets). A a-frequent set X € tF{a,d) is 
closed if and only if fr {X, d) > fr (X U {A} , d) for all A G R\X. 

The collection of closed a-frequent sets is denoted by C (cr, d). 

The collection C {a, d) consists of closures of the sets in iF {a, d), i.e., the sets 
X G T {a, d) such that X = cl (X, d) where 

cl (X, d) = arg max {fr (X) :YDX,Y £ C (a,d)} . 

Clearly, the frequency of the set X £ T (a, d) is the maximum of the frequencies 
of the closed frequent supersets of X, i.e., the frequency of its closure 

fr (X, d) = fr {cl (X, d) ,d) = max [fr {Y) : X 3 X, X £ C {a, d)} . 

Although there are many positive aspects on representing the pattern col- 
lection and their interestingness values by an irredundant subset of the pattern 
collection, there are some benefits achievable by separating the structure of the 
collection from the interestingness values. For example, the patterns alone can 
always be represented at least as compactly as (and most of the time much more 
compactly than) the same patterns with their interestingness values. As a con- 
crete example, the collection of frequent sets and the collection of closed frequent 
sets can be represented by their subcollection of maximal frequent sets: 

Definition 2 (Maximal cr-frequent sets). A a-frequent set X £ iF{a,d) is 
maximal if and only if Y D X,Y £ iF {a, d) ^ X = Y . 

The collection of maximal a-frequent sets is denoted by M. {a,d). 

Clearly, the frequent sets can be determined using the maximal frequent sets: 
a set X C i? is in the collection T (tr, d) if and only if it is a subset of some set 
in M. (cr, d). 

The collection of maximal frequent sets represents the collection of (closed) 
frequent sets quite compactly since \M (cr, d)| is never larger than \C (cr, d)| but 
it can be exponentially smaller than the corresponding closed frequent set col- 
lection (and thus the frequent set collection, too): Let the data set d con- 
sist of the sets R \ {A} s.t. A £ R. Then the collection of (closed) (1/n)- 
frequent sets consists of all subsets of R except R itself but the collection 
of maximal frequent sets consists only of the sets R\ {A}, A £ R. Then 
|X (a, d)| / \M (cr, d)| = |C (cr, d)| / \M (cr, d)| > 21-^1 /\R\. Also, the only case when 
the number of maximal frequent sets is equal to the number of frequent sets is 
when the collection of frequent sets consists solely of singleton sets, i.e., frequent 
items. 

In addition to the maximal frequent sets, the frequent sets can be described 
also by the minimal infrequent sets: 

Definition 3 (Minimal cr-infrequent sets). A a -infrequent set X £ 2^ \ 

T (cr, d) is minimal if and only if Y C X, X £ 2^ \ X (cr, d) => X = X. 

The collection of minimal a -infrequent sets is denoted byX{a,d). 
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Again, obtaining the frequent sets from this representation is straightforward: 
a set X C R is cr- frequent if and only if it does not contain any set y G I (a, d). 

Minimal infrequent sets for a given collection of frequent sets can be com- 
puted quite efficiently since the minimal infrequent sets are the minimal transver- 
sals in the complements of the maximal frequent sets [9] . 

It is not immediate which of the representations - the maximal frequent 
sets or the minimal infrequent sets -- is smaller even in terms of the num- 
ber of sets: the number |A4((T, d)| of maximal cr-frequent sets is bounded by 
(|i?| — crn -I- 1) \L (cr, d)\ (unless the collection I {a, d) minimal cr-infrequent sets 
is empty) but \X (cr, d)| cannot be bounded by a polynomial in \Ai (cr, d)|, in |i?| 
and in n (i.e., the number of transactions in d) [19]. 

In practice, the representation for a given collection of frequent sets can be 
chosen to be the one that is smaller for that particular collection. Furthermore, 
instead of choosing either the maximal frequent sets or the minimal infrequent 
sets, it is possible to choose a subcollection determining the collection of frequent 
sets uniquely that contains sets from both collections: 

Problem 2 (Smallest representation of frequent sets). Given collections M {cr,d) 
and I (cr, d), find the smallest subset T of M (cr, d)Ul (cr, d) that uniquely de- 
termines J- {(J,d). 

By definition, all subsets of minimal infrequent sets are frequent and all su- 
persets of maximal frequent sets. Thus, Problem 2 can be modeled as a minimum 
weight set cover problem: 

Problem 3 (Minimum weight set cover [20]). Given a collection S of subsets of 
a finite set S and a weight function w : 5 — >■ K, find a subcollection S' of S with 
the smallest weight w {S') = 

The minimum weight set cover problem is approximable within a factor 
l-hln|5| [20]. 

The set cover instance corresponding to the case of representing the fre- 
quent sets by a subcollection of M. (cr, d)\JX (cr, d) is the following. The set S 
is equal to M{cr,d) UX{a,d). The set collection S consists of the set Sx = 
{Y G Ai{a,d)UX{a,d) ■. X CY V X f)Y} for each X G M (a,d) U X (cr, d). 
Glearly, the solution T C A4 {a, d) U X {a, d) determines the collection T (cr, d) 
uniquely if and only if the solution S' for the corresponding set cover instance 
covers S. Furthermore, w {S') = \T\ when w (X) = 1 for all X G S. Due to this 
reduction and the approximability of Problem 3 we get the following result: 

Theorem 1. Problem 2 is approximable within a factor 1 -|- 

In \M {a,d) +X (cr, d) \ . 

A very interesting variant of Problem 2 is the case when user can interac- 
tively determine which attributes or frequent sets must be represented. This can 
be modeled as a minimum set cover problem, too. In the case when user only 
adds attributes or frequent sets that must be represented this problem can be 
approximated almost as well as Problem 2 [21]. 

In addition to knowing which sets are frequent (or, in general, which pat- 
terns are interesting), it is usually desirable to know also how frequent each of 
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the frequent sets is. One approach to describe the frequencies of the sets (ap- 
proximately) correctly is to construct a small representative data set d' from the 
original data set d [22,23]. One computationally efficient and very flexible way 
to do this to obtain a random sample from the data set. 

In the next two sections we study this approach. In Section 3 we give upper 
bounds on how many transactions chosen randomly from d suffice to give good 
approximations for the frequencies of all frequent sets simultaneously with high 
probability. The bounds can be computed from closed frequent sets which might 
be beneficial when each randomly chosen transaction is very expensive or the 
cost of the sample should be bounded above in advance. In Section 4 we show 
how the frequency estimates computed from the sample can be significantly 
improved by weighting the transactions using convex programming. 

3 Sample Complexity Bounds 

If the collection T {a, d) is known then it can be shown by a simple application of 
Chernoff bounds that for a sample d' of at least In (2 \T\ /A) /2e^ transactions, 
the absolute error of the frequency estimates for all sets in the collection is at 
most e with probability at least 1 — A [18]. 

However, the bounds given in [18] can be improved significantly since the 
frequent set collections have structure that can be useful when estimating the 
sufficient size for the sample. If the set collection T is known, the data set d itself 
is usually a compact representation of the covers of the sets in T . The goal of 
sampling is to choose a small set of transactions that accurately determine the 
sizes of the covers for each set in T . This kind of sample (also for arbitrary set 
collections in addition to the collections of covers) is called an e-approximation 
[24]. The definition of e-approximation can be expressed for covers of a set col- 
lection as follows: 

Definition 4 (e-approximation). A finite subset T C [n] = {1, . . . , n} is an 
e-approximation for the set collection T w.r.t. d if we have, for all X £ T 

\T A cover {X , d)\ \cover{X,d)\ ^ 

m n -"■ 

The sample complexity bound given in [18] is essentially optimal in the gen- 
eral case. However, we can obtain considerably better bounds if we look at the 
structure of the set collection. One of such structural properties is the VC- 
dimension of the collection: 

Definitions ( HO-dimension). The VC -dimension VC {cover {T , d)) of the 
set collection cover {T, d) = {cover (X, d) : X G is 

VC (cover (iF, d)) = max | |T| : ]{T 0 cover (X, d) : X G = 2l^l | . 

Given the HG-dimension of the collection of covers, the number of trans- 
actions that form an e-approximation for the covers can be bounded above by 
the following lemma adapted from the corresponding result for arbitrary set 
collections [24]: 
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Lemma 1 ( LC-dimension bound). Let cover (tF,d) be a set system of VC- 
dimension at most k, and let e < 1/2. Then there exists an e- approximation for 
S of size at most O (fclog(l/e)) /e^ 

Actually, in addition to mere existence of small e-approximations, an e- 
approximation of size given by the LC-dimension bound can be found efficiently 
by random sampling [25]. 

One upper bound for the LO-dimension of a frequent set collection can be 
obtained from the number of closed frequent sets: 

Theorem 2. The VC -dimension of the collection cover {iF (a,d) ,d) is at most 
log |C (cr, d)| where C{a,d) is the collection of closed frequent sets in iF{a,d). 
This bound is tight in the worst case. 

Proof. The LO-dimension of cover {T {a,d) ,d) is at most 
\og\cover {T {a,d) ,d)\. Clearly, \cover {C {a,d) ,d)\ < \C{a,d)\. Thus it suffices 
to show that cover (X, d) = cover (cl (X ) , d). However, this is immediate since, 
by definition, cl (X,d) is contained in each transaction of d which contains X. 

The CC'-dimension log\cover {C (a,d) ,d)\ is achieved by the collection of 
0-frequent sets for a data set determined by n attributes 1,2, ...,n with 
cover (i,d) = [n] \ {i} for each i G [n]. The the T{(j,d) (and C{cr,d), too) 
consists of all subsets of [n\. □ 

The cardinality of C (cr, d) can be estimated accurately by checking for random 
subset of frequent sets, how many of them are closed in d. Theorem 2 together 
with Lemma 1 imply the following upper bound: 

Corollary 1. For the set collection J-(a,d), there is an e- approximation of 
O (log |C (ct, d)| log(l/e)) /e^ transactions. 

These bounds are quite general upper bounds neglecting many fine details of 
the frequent set collections and thus usually smaller samples give approximations 
with required quality. It is probable that the bounds could be improved by 
taking into account the actual frequencies of the frequent sets. A straightforward 
way to improve the upper bounds is to bound the LC-dimension more tightly. 
In practice, the sampling can be stopped right after the largest absolute error 
between the frequency estimates and the correct frequencies is at most e. 

4 Optimizing Sample Weights 

Given a random subset d' of transaction in d, the frequencies in d for the sets in 
a given set collection T can be estimated from d. If also the correct frequencies 
are known, the error of the estimates can be measured. 

Furthermore, the transactions in the sample can be weighted (in principle) in 
such a way that the frequency estimates are as good as possible. Let us denote the 
sample from d = {c?i, . . . , d„} by d' = {d{, . . . , d'.^,} and let wi, . . . , ww denote 
the weights of d[,. . . , d'^, . The frequency of X in the weighted sample d' is the 
sum of the weights Wi of d's containing X. Note that due to the weights we can 



482 



T. Mielikainen 



assume that d' is a set although c? is a multiset. Furthermore, we assume that 
all weights are nonnegative, i.e., no transaction in d' can be an anti-transaction. 

The search for optimal weights w.r.t. a given loss function £ can be formulated 
as an optimization task as follows: 

minimize £{T,d^d',w) 
subject to tCi > 0, z = 1, . . . , nb 



The weight Wi can be interpreted as a measure how representative the trans- 
action d' is in the sample d' and thus the weights can be used also to score 
the transactions: On one hand the transactions with significantly larger weights 
than the average weight can be considered as very good representatives of the 
data. On the other hand the distribution of the weights tells about the skewness 
of the sample (and possibly also the skewness of the set collection T) w.r.t. d. If 
the loss function £ is convex then the optimization task can be solved optimally 
in (weakly) polynomial time [26] . The most well-known examples of convex loss 

p\ i/p 



functions are Lp distances 



(S 






} 



fr {X, d) — J2x(Zd'-ed' 

As a concrete example, let us consider the problem of minimizing the max- 
imum absolute error of the frequency estimates. This variation of the problem 
can be formulated as a linear optimization task which can be solved even faster 
than the general (convex) optimization task: 



minimize e 

subject to e> fr (A, d) — ^ Wi 

XCd'.,i£[n'] 

e> ^ Wi- fr {X, d) 

XCd',iG[n'] 

rci > 0, z = 1, . . . , n' 

The instance consists of n' -I- 1 variables and 2 \£F (cr, d)\ + n' inequalities. The 
number of inequalities in the instance can be further reduced to 2 |C (cr, d)| -I- [S'] 
by recognizing that if cl (A, d) = cl (Y, d) then the corresponding inequalities 
are equal, i.e., it is enough to have the inequalities for the collection C (cr, d) of 
closed frequent sets. This can lead to exponential speed-ups. 

To evaluate the usability of weighting random samples of transactions, we 
experimented with the random sampling of transactions (without replacements) 
and the linear programming refinement using the IPUMS Census data set from 
UCI KDD Repository:^ IPUMS Census data set consists of 88443 transactions 
and 39954 attributes. The randomized experiment were repeated 80 times. 

The results are shown in Figure 1. The labels of the curves correspond to the 
minimum frequency thresholds and the curves with (Ip) are the correspond- 
ing linear programming refinements. The results show that already a very small 
number of transactions, when properly weighted, suffice to give good approxi- 
mations for the frequencies of the frequent sets. Similar results were obtained in 
our preliminary experiments with other data sets. 

http : //kdd . ics . uci . edu 
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the number of transactions 



Fig. 1. IPUMS Census data 



Note that although it would be desirable to choose the transactions that 
determine the frequencies best instead of random sample, this might not be easy 
since finding the best transactions with optimal weights resembles the cardinality 
constrained knapsack problem that is known to be difficult [27]. 



5 Conclusions 

In this paper we studied how to separate the descriptions of patterns and their 
interestingness values. In particular, we described how frequent sets can be de- 
scribed in a small space by the maximal frequent sets, the minimal infrequent 
sets, or their combination. Also, we studied how samples can be used to describe 
frequencies of the frequent sets: we gave upper bounds for the sample com- 
plexity using closed frequent sets and described a practical convex optimization 
approach for weighting the transactions in a given sample. 

Representing pattern collections and their interestingness values separately 
seems to offer some benefits in terms of understandability, size and efficiency. 
There are several interesting open problems related to this separation of the 
structure and the interestingness: 

— What kind of patterns and their interestingness values can be efficiently and 
effectively described by a sample from the data? 
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— How a small data set that represents the frequencies of frequent sets can be 
found efficiently? 

— What kind of weightings are of interest to determine the costs for the max- 
imal frequent sets and the minimal infrequent sets? 

— How the representation of the patterns guides the knowledge discovery pro- 
cess? 
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Abstract. In many cases, normal uses of a system form patterns that 
will repeat. The most common patterns can be collected into a predic- 
tion model which will essentially predict that usage patterns common in 
the past will occur again in the future. Systems can then use the predic- 
tion models to provide advance notice to their implementations about 
how they are likely to be used in the near future. This technique cre- 
ates opportunities to enhance system implementation performance since 
implementations can be better prepared to handle upcoming usage. 

The key component of our system is the ability to intelligently learn 
about system trends by tracking file system and memory system activity 
patterns. The usage data that is tracked can be subsequently queried 
and visualized. More importantly, this data can also be mined for intelli- 
gent qualitative and quantitative system enhancements including predic- 
tive file prefetching, selective file compression and and application-driven 
adaptive memory allocation. We conduct an in-depth performance eval- 
uation to demonstrate the potential benefits of the proposed system. 



1 Introduction 

System interfaces cleanly separate user and implementor. The user is concerned 
with invoking the system implementation with legal parameters and in a correct 
state (e.g. only writing to a valid file handle). The implementor is concerned 
with providing an efficient and correct solution to a systems-level problem (e.g. 
completing a file write operation as quickly as possible, returning an error if the 
operation has failed). Yet these two opposite ends of system interfaces can be 
designed, built, and deployed without considering the usefulness of information 
that may not be available until runtime. Systems software components can be 
optimized by adapting their implementations to how they are being used. 

For example, if a file system knows in advance that a particular user will 
soon open a specific set of files in a specific order, the file system can anticipate 
those operations and pull data that is about to be opened into filesystem cache, 
to speed up future operations. Or a dynamic memory manager that is expecting 
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a certain type of allocation requests from a particular program can optimize the 
memory manager to better serve the needs of the program. Although there is a 
wealth of data about system usage available at runtime, only a portion of the 
available information is useful. It remains a challenge to collect, analyze and 
mine the most useful bits of knowledge from the sea of available data. In this 
paper, we discuss tools and methods for the collection, analysis and mining of 
system usage data. The key contributions of our work are: 

1 . mining user-specific file system traces and enabling personalized prefetching 

of the user’s file system space 

2. demonstrating the viability of automatically generating application-specific 

suballocators and providing tools to automate the process 

The rest of this paper is organized as follows. Section 2 discusses related 
work. Sections 3 and 4 detail the architecture and experimental results for a 
specialized file system. Sections 5 and 6 detail the architecture and experimental 
results for a specialized memory allocator. Section 7 discusses conclusions and 
Sect. 8 discusses future work. 

2 Related Work 

Kroeger and Long [1] evaluated several approaches to file access prediction. The 
model they promoted used frequency information about file accesses to develop 
a probabilistic model which predicted future accesses. However, the downside of 
storing frequency information about successors is that the memory requirements 
are high, with potentially little gain if rarely-used successors are stored; in our 
N-gram approaches, we aimed to reduce the state information over these fre- 
quency approaches. The N-gram prediction model is commonly used in speech- 
processing research [2] , and associates a sequence of events of length N with the 
event that follows. Su et al. [3] extended the N-gram with an N-gram-h model, a 
collection of N- grams that work together to make predictions. Mowry et al. built 
a system [4,5] which automatically inserts prefetch and release instructions into 
programs that work with large (out-of-core) data sets, to reduce virtual memory 
I/O latency. In [6], the authors develop Markov models based on past WWW 
accesses to predict future accesses; however, they base predictions on the history 
of many users, whereas in our work we personalize our predictions per user. 

In [7], Grunwald and Zorn develop CustoMalloc, a tool that develops a cus- 
tomized allocator to replace malloc in a C program, given the program’s source 
code. CustoMalloc inspired the work we have done with dynamic memory man- 
agement. The primary difference between CustoMalloc and our system is that 
CustoMalloc requires access to program source-code, while ours does not. In [8], 
Parthasarathy et al. developed a customized memory allocator for parallel data 
mining algorithms that reduced program execution time. By observing how data 
structures were being accessed, they took care to place data structure nodes in 
a way that reduced false sharing yet increased spatial locality. 
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3 Prefetching Filesystem Architecture 

This section describes an architecture that increases filesystem performance 
through prefetching. Figure 1 displays an architectural overview; each major 
component of this architecture will now be discussed. 




Fig. 1. A specialized prefetching filesystem 



3.1 Filesystem Access Traces 

In order to apply filesystem access trace data in useful ways, we obviously needed 
a source of filesystem traces. We decided to generate our own traces. Our test 
machine was a Solaris 2.8 system with four 296-MHz CPUs and three Gigabytes 
of RAM. We designed a program that would start up when a login shell was 
invoked by a user and shutdown when that user logged out. This way, each 
set of filesystem access traces was generated on a per-user basis, resulting in no 
interference from other users or system processes. Our users consisted of students 
who simply performed their normal work on the system for several weeks while 
the tracing program was enabled. For example, they checked email, edited files, 
browsed web sites, compiled programs, etc. The attributes we stored per each 
file access included the name of the file accessed, the date and time of the access, 
the access mode (read, write, or read+write), the application that accessed the 
file and the process ID that accessed the file. 



3.2 Pattern Analysis and Mining 

There are several models presented for usage pattern analysis, including the 1- 
gram, 2-gram+, p-s-gram+ and association rule mining models. Following the 
discussion of the prediction models, an application of the prediction models will 
be discussed, a prefetching cache. 



p-s-gram+ models. As mentioned in Sect. 1, an N-gram associates a sequence 
of events of length N with the event that follows. The 1-gram approach is an 
instance of the N-gram model where N = 1, and all events are file accesses. It 
has also been described as a last-successor model [1], because it records the last 
successor of each file access. When used in prediction, the 1-gram simply predicts 
that the same file to succeed a given file the last time around will also succeed 
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the file the next time around. The 2- gram + approach is an N-gram+ where the 
maximum N is 2. It subsumes both a 2-gram and a 1-gram and maintains tables 
for both, giving priority to the 2-gram table for predictions. 

The p-s-gram+ model is a generalization of the N-gram+ model. Rather 
than predicting only one successor file, it can be configured to predict any num- 
ber of successors (s). The number of predecessors (p) is configurable as usual. 
The system supports fall back for p so that if the predictor with the largest 
predecessor count cannot cannot make a prediction, the predictor with second- 
largest predecessor count is consulted, and so on. For example, if the file accesses 
[A B C A Z Y] were made previously, a 2-3-gram+ would build the prediction 
tables at the left in Fig. 2. 

When used in prediction, the p-s-gram+ performs exactly like the N-gram+, 
except that multi-file predictions can be made. For example, if the above tables 
had been built and the input BC was given to the p-s-gram+, the output would 
be AZY, meaning that files A, Z and Y are all predicted to be accessed soon. 

Although the p-s-gram+ and similar models are somewhat limited in the 
sense that they only keep a single collection of successor values compared to an 
approach that keeps a probability distribution of successor values, one advan- 
tage is that the implementation for prediction table maintenance is simplified. 
Also, outdated successors are removed from the tables automatically when newer 
successors come along, thus avoiding an expiration policy for rare successors. 

Cache simulations revealed that the 1-predecessor and 5-successor combina- 
tion performed as good as any of the p-s-gram+ models tested. Therefore, the 
l-5-gram+ was chosen to represent the p-s-gram+ model in this paper. 



Example p-s-gram-|- table 



Example ARM table 



2-3-gram 1-3-gram 



key 


value 


AB 


C A Z 


BC 


AZY 



key 


value 


A 


B C A 


B 


C A Z 


C 


AZY 



Antecedent 


Consequent 


A 


B,C,Z 


BC 


A,Z,Y 


ABC 


Z 


BCAZ 


Y 



Fig. 2. Prediction tables 



Association Rule Mining. The Association Rule Mining approach {ARM) 
builds association rules [9] out of the file accesses that have been seen. The 
Apriori [10] algorithm is applied to the training data to yield a set of association 
rules that can be used during testing. This approach does not consider the order 
of file accesses. Rather, it produces rules pertaining to all files accessed within 
the same window size (configured to 5 accesses). For example, if the file accesses 
[A B C A Z Y] were made during training, ARM might build the rightmost 
table of Fig. 2 (for brevity, only a few rules are shown). 
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By definition, the ARM approach subsumes the rules generated by the p-s- 
gm?7i+-based models. In addition, ARM will generate rules about files associated 
a few accesses apart, in any order. Each rule also has a value for support (how fre- 
quently all files mentioned in the rule appear together, compared to all files that 
appear together) and confidence (what percentage of the time the consequent of 
the rule appears, if the antecedent of the rule appears). 

When used in prediction, ARM uses some portion of recent file accesses to 
generate a list of possible predictions. This list is sorted by support, then by 
confidence, and finally by length of the antecedent of the rule. Sorting by this 
order yielded the best results on average during testing than sorting by any other 
possible sort order. Finally, a configurable number of predictions with highest 
support are made. 



Prefetching Cache. Using these prediction models, a prefetching cache can be 
built. After each file access is made, a cache system could use a prediction model 
to determine what files are expected to be accessed soon. If the prediction model 
returned any predictions and there was spare I/O bandwidth, the cache could 
begin to fill with data from the predicted files, in the hopes that future regular 
accesses for the predicted files would result in cache hits. Finally, the system 
could make updates to reflect recent accesses. Instead of just prefetching a file, 
the system could also uncompress a compressed file. A simulation of prefetching 
caches is presented in Sect. 4. 

4 Results from a Prefetching Cache Simulation 

The effectiveness of the prediction models 
was tested via a filesystem cache simulation. 
The data set used in this paper is referred to as 
vip-fstrace and contains a single graduate stu- 
dent’s file access traces over a 2-week period, 
totaling around 20,000 file accesses; we have per- 
formed the same experiments using data sources 
from other users as well, with comparable re- 
Fig. 3. Hit rate comparison suits. All of the file accesses in vip-fstrace were 
captured during normal day-to-day use of a filesystem. In the experiments that 
follow, cachesize defines the number of files the cache can hold. In this paper, 
we only consider whole-file caching. Training amount defines what percentage of 
the file accesses were used as training data to build the prediction model. During 
training, no statistics on cache hits were kept. 

There were a number of caching strategies that were simulated, including the 
non-predicting model, the 1-gram model, the 2-gram-h model, the 1-5-gram-h 
model, and the association rule mining model. In addition, we built a hybrid 
model out of the top-performing ARM and 1-5-gram-h models. The hybrid model 
simply consults each of its sub- models for predictions, and makes any predictions 
that either sub- model offers. 



Model 


Hit Rate 


non-predicting 


63% 


1-gram 


84% 


2-gram-|- 


87% 


1-5-gram -I- 


90% 


ARM 


90% 


hybrid 


91% 
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The non-predicting approach is a baseline cache that has no knowledge of the 
file access histories. This cache simply adds recently accessed files to the MRU 
end and removes items from the LRU to make room for new entries in cache. 
When a file in cache is accessed, it becomes the MRU object. The non-predicting 
approach can still yield a high hit rate, given sufficient re-use of recently accessed 
files. The other caches were built by augmenting the baseline cache with one of 
the forms of prediction discussed in this paper. Figure 3 shows a direct compar- 
ison of the caching strategies. 

The hybrid approach offers an advantage in that predictions missed by one 
model are caught by the other, which explains the slight increase in hit rate for 
the hybrid model over its components. 

A closer look at the 1-5-gram-h, ARM and hybrid hit rates reveals how close 
to optimal the predictors can be. These caches missed around 10% of the time. 
However, further investigation showed that 75% of these misses were due to 
accesses during testing for files that did not exist during training. For these files, 
it would be impossible to generate any predictions anyway, since no patterns 
involving these files have been seen. So, the optimal hit rate with this data set 
would be about 92.5% anyway. 

Figure 3 used the vip-fstrace data, with cachesize set to 70 files and training 
amount set to 55%. For the ARM system, max predictions were set to 9 (meaning 
that up to 9 files will be prefetched into filesystem cache) , support threshold was 
set to 0.005 and confidence threshold was set to 0.5. Unless otherwise specified, 
the remaining experiments all use this configuration. 





Training Amojnt I/O Cost (seconds per access) 

Fig. 4. Hit rates as training data varies, and the effect of varying I/O cost 



4.1 Varying Training Data Amount 

The left figure of Fig. 4 shows the hit rate performance of the various approaches 
to caching as training size varies. Not surprisingly, all of the approaches that 
actually trained performed better with a larger training data amount. The results 
were obtained by averaging together 5 unique random samples, each using 50% 
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of the available records for both training and testing purposes. For all training 
amounts larger than 50%, all predictive approaches yielded hit rates about 1.5 
times greater than the non-predicting approach. 



4.2 Prefetch Arrival Time 

When a prefetch is made, time is spent pulling the file into cache to ensure 
it is available in cache for a later fetch. It is crucial that most of the time, 
prefetched files arrive in cache before the later fetch is made. If the prefetch is 
still in progress when the predicted fetch occurs, then the prefetch was not only 
a waste of time, but also may have caused unrelated I/O to block. On the test 
system, the cost of each file access was determined to be between 0.001 and 0.002 
seconds. However, the cost of performing I/O is system-dependent, and varies 
with each instance, depending on available I/O bandwidth at access time. 

The right figure of Fig. 4 shows the cache hit rate as the cost of I/O varies. 
As the cost of I/O increases, performance degrades for all of the prefetching 
approaches. With sufficiently slow I/O, all of the prefetching systems degrade 
to worse than the non-prefetching approach. For all I/O costs tested, the ARM- 
based approaches {ARM, hybrid) performed, on average, 1.4 times better than 
the p-s-gram-t—ha,sed approaches. 



5 Application Driven Adaptive Memory Allocation 

A general purpose dynamic memory manager has the burden of providing good 
performance to a variety of applications. In order to achieve this goal, some 
sacrifices must be made. For example, a general purpose memory manager has 
to develop a general strategy for exactly how and when to reserve additional 
heap memory for a process via an expensive system call such as sbrk. A general 
purpose memory manager must also make a decision as to its arrangement of 
free lists. In order to avoid having to reserve additional heap memory to satisfy 
every malloc request, most memory managers will reserve a large block of heap 
memory at once and incorporate unused portions of the heap memory block into 
free lists of various sizes. Then, when a malloc request arises of a given size, 
it may be possible to satisfy the request by retrieving an item from one of the 
free lists, thus minimizing the number of times that more heap memory must be 
acquired. The optimal arrangement of these free lists varies per application. 

Sometimes programmers develop application-specific memory managers or 
suballocators to try to leverage their own knowledge of their programs. While 
malloc is designed to handle many request sizes, a suballocator could be writ- 
ten to handle the most frequently allocated size in an optimal manner. However, 
developing these per-application suballocators is time consuming in terms of 
programmer time and there is no guarantee that the suballocator will improve 
performance. Ideally, the problem of when and how to optimize dynamic alloca- 
tion on a per-application basis could be solved generically. 
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To generically solve the problem of per-application allocation, we developed 
a system which automatically profiles an application’s allocations during initial 
runs of the program, helps to build a dedicated suballocator for the program, and 
utilizes the suballocator to enhance performance for future runs of the program. 
All of this can be done without recompiling the application program. 

In order to profile the allocation behavior of a program, special tracing 
versions of malloc, reallies, calloc and free were developed. Using the 
LD_PRELOAD environment variable shared-library hook technique, these special 
tracing versions replaced the standard C library versions during a program’s ex- 
ecution. They recorded each allocation and free that occurred during a program’s 
execution to a history file, along with the allocation size. 

We developed an analysis tool to analyze the allocation history file. The 
analysis tool identified how many allocations were of each size, and which sizes 
were allocated most frequently, and how often a request for a block of given size 
was made after the release of a block of the same size (referred to in this paper as 
a reuse opportunity). The tool also reported the maximum number of blocks for 
each size that were simultaneously freed yet later could be reused; this guided 
the length of the free lists for each size. 

The analysis tool also produced a graph of the total memory in use during 
each allocation or deallocation event. By viewing this graph, you can quickly see 
whether or not an application could take advantage of enhanced reuse function- 
ality. For example, the left graph of Fig. 5 shows an example of an application 
where memory is acquired, released and then reacquired. Each reacquisition of 
a block of memory represents a reuse opportunity. On the other hand, the right 
graph of Fig. 5 shows an example of an application with a monotonically in- 
creasing memory footprint, which means there are no reuse opportunities. 





Fig. 5. Different applications have different reuse opportunities 



To build a dedicated suballocator for a program, we specified the following 
properties: the initial size of the heap, the amount the heap should grow when 
needed, and the size class and length of between zero and four free lists. It 
has been observed [7] that over 95% of allocation requests could typically be 
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satisfied by using only four free lists, so we only used up to four free lists. For 
the applications we evaluated, we found this to be true as well. The construction 
of the suballocator was complete when the above properties were written to a 
properties file corresponding to that application. 

Finally, to utilize the suballocator, special versions of malloc , realloc , 
calloc and free were developed. Using the same LD_PRELOAD shared-library 
hook technique that enabled the tracing, these special methods replaced the 
standard C library versions during the program’s execution. Before the program 
is executed, this special version of malloc was initialized. During initialization, 
the file containing the properties that specified the details of the suballocator 
was read into memory. The default heap was extended to be as big as the initial 
size of the heap memory as specified in the file. 

If there were any free lists specified in the properties file, then they were 
initialized at this time. All of the nodes constituting each free list are stored 
contiguously, starting from the beginning of the heap. If an allocation request 
cannot be satisfied using one of the free lists, a general purpose first-fit allocation 
strategy is used. 

6 Results from a Specialized Memory Manager 

This section reveals some notable performance improvements in program exe- 
cution time discovered by trying out suballocators on a variety of applications. 
In each case, at least five runs were executed with and without the suballoca- 
tor, and the timings for each were averaged. For all of the tests which accessed 
files, several initial “priming” runs were executed and the results thrown away, 
to eliminate the effect of file system cache. The test system was a GNU/Linux 
system with the GNU G library version 2.2.5 installed. 

There were actually many applications which showed no benefit during test- 
ing; in fact, in some cases, the suballocators slowed down the performance. This 
reflects the reality that the default dynamic allocator does an excellent job in 
many cases, and is difficult to beat on average. Since the suballocator can be 
selectively turned on or off per-application, it can be leveraged where it is needed 
most, and turned off otherwise. 

Figure 6 lists some applications whose performance were enhanced via a 
suballocator. All of the above performance improvements were realized without 
requiring source-code access to any of the programs. These performance improve- 
ments were discovered after modest amounts of analysis and re-engineering of 
the suballocators to find the best possible performance. 

7 Conclusions 

We have several specific conclusions to make related to a prefetching filesystem 
cache. The l-5-gram+ model and the ARM model are both excellent, near- 
optimal predictive models for file accesses. A hybrid combination of these appears 
to do even slightly better, as shown in Fig. 3, as predictions missed by one 
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application 


default 


suballocator 


improvement 


convert: convert a 0.5 MB image 
from JPEG to GIF format 


20.70 sec 


20.13 sec 


2.83% 


tar: unpacking the linux kernel 
sources 


123.77 sec 


114.75 sec 


7.87% 


gawk: update 100,000 key value 
pairs via gawk’s associative arrays 


11.83 sec 


10.95 sec 


8.04% 


zip: use Info-ZIP to compress 6MB 
of text files 


5.04 sec 


4.59 sec 


9.68% 



Fig. 6. Performance improvements using suballocators 



model are caught by the other. The lower space requirements and simpler, more 
efficient pattern analysis make the l-5-gram+ model attractive. The ARM model 
performs best when it makes a limited number of predictions; unrestricted, ARM 
would have so many predictions to make that it would end up polluting the cache. 

We have explored the idea of using memory usage patterns to enhance dy- 
namic allocation performance. We discovered applications whose overall program 
execution time improved by nearly 10% by utilizing suballocators, without re- 
quiring access to the applications’ source code. These per-application improve- 
ments indicate that for some applications, there are repetitious memory alloca- 
tion patterns that are manifested across multiple application runs. When such 
patterns exist, they can clearly be leveraged to improve performance. 

8 Future Work 

Future work in the area of file systems includes the full-blown implementation 
of a prefetching file system cache with automatic decompression. The implemen- 
tation work will involve replacing the existing file system’s cache infrastructure 
with our own, and measuring changes in performance. The implementation will 
want to consider the size of prefetched files, to avoid, in some cases, prefetching 
files that are so large that they will poison the cache. 

In the area of dynamic memory management, we would like to integrate into a 
state of the art dynamic memory manager the ability to generate per-application 
suballocators. This would avoid the need to use a dynamic library feature of 
GNU /Linux to enable the suballocators, a feature that is not necessarily available 
on all platforms. In addition, we would like to fully automate the process of 
suballocator generation. 
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Abstract. This paper proposes an approach for automatic text extraction 
method using neural networks. Automatic text extraction is a crucial stage for 
multimedia data mining. We present an artificial neural network(ANNs)-based 
approach for text extraction in complex images, which uses a combined 
method of ANNs and non-negative matrix factorization(NMF)-based filtering. 
An automatically constructed ANN-based classifier can increase recall rates 
for complex images with small amount of user intervention. NMF-based 
filtering enhances the precision rate without affecting overall performance. As 
a result, a combination of two learning mechanism leads to not only robust but 
also efficient text extraction. 



1 Introduction 

Content-based image indexing is the process of attaching content-based labels to 
images and video frames. Based on the fact that texts within an image are very useful 
for describing the contents of the image and can be easily extracted comparing with 
other semantic contents, researchers have attempted text-based image indexing using 
various image processing techniques [1-7, 14, 15]. In the text-based image indexing, 
automatic text extraction is very important as a prerequisite stage for optical character 
recognition, and it has been still considered a very difficult problem because of text 
variations in size, style, and orientation as well as a complex background of images. 
There are two primary methods for text extraction: connected component (CC) 
methods (CCMs) and texture-based methods. The CCMs [2,4] are very popular for 
text extraction thanks to their simplicity in implementation, but are not appropriate for 
low-resolution and noisy video documents because they depend on the effectiveness 
of the segmentation method and it is not easy to filter non-text components using only 
geometrical heuristics. Unlike the CCMs, the texture-based methods [1,3, 5, 7] regard 
text regions as textured objects. Although very effective in text extraction, the texture- 
based methods have some shortcomings: difficulties in manually designing a texture 
classifier for various text conditions, locality of the texture information, and 
expensive computation in the texture classification stage. 

This paper proposes an approach for attacking several difficulties for text 
extraction in complex color images, and shows its affirmative results. Based on the 
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sequential hybrid merging methodology of the texture and CC-based methods, we aim 
to increase a recall rate with multi-layer perceptrons (MLPs) and a precision rate with 
CC analysis (Figure 1). MLPs automatically generate a texture classifier that 
discriminates between text regions and non-text regions on three color bands. Unlike 
other neural network-based text extraction methods, the MLPs receive raw image 
pixels as an input feature. Therefore, complex feature designing and feature extraction 
can be avoided. Bootstrap method is used to force the MLPs to learn a precise 
boundary between text and non-text classes, which results in reliable detection 
performance for various conditions of texts in real scenes. 

Although we use the bootstrap method, the detection results from the MLPs 
include many false alarms. In order to tackle this shortcoming of the texture-based 
method, we use CC-based filtering using a NMF technique: we have applied three 
stage of filtering on CCs using: features of CCs such as area, fill factor, and horizontal 
and vertical extents; geometric alignment of text components; and part-based shape 
information using NMF. 




Fig. 1. Overall Structure. 



For the last step, we use two different region-marking approaches according to the 
given image types (video image and document image) both to enhance time 
performance and to get better results. For video images, we use CAMShift, and for 
document images, we does X-Y recursive cut algorithm. 



2 Neural Networks for Text Detection 

We use MLPs to make a texture classifier that discriminates between text pixels and 
non-text ones. Readers refer the author’s previous publication [7]. An input image is 
scanned by the MLPs, which receives the color values of a given pixel and its 
neighbors within a small window for each Red, Green, and Blue color band. The three 
MLPs’ outputs are combined to text probability images (TPI), where each pixel’s 
value is in the range [0,1] and represents the probability that the coiTesponding input 
pixel is a part of text. The pixel that has a larger value than given a threshold value is 
considered as a text pixel. 
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Figure 2 shows the structure of the network for text location. A neural network- 
based arbitration method is used to describe the influence of each color band. An 
arbitration neural network produces a final result based on the outputs of the three 
neural networks. After the pattern passes the network, the value of the output node is 
compared with a threshold value and the class of each pixel is determined. As a result 
of this classification, a classified image is obtained. 

Location 
/ of Interest 





Placing 

Bounding 

Boxes 




Fig. 2. Architecture of a discrimination network 

When using an example-based-leaming approach, it would be desirable to make 
the training set as large as possible in order to attain a comprehensive sampling of the 
input space. Flowever, when considering real-world limitations, the size of the 
training set should be moderate. Then the problem is how to build a comprehensive 
but tractable database. For text samples, we simply collect all text images we can find. 
However collecting non-text samples is more difficult. Practically infinite images can 
serve as valid non-text samples. To handle this problem we use the bootstrap method 
recommended by Sung and Poggio [9], which was initially developed to train neural 
networks for face detection. Some non-text samples are collected and used for 
training. Plus, the partially trained system is repeatedly applied to images, which do 
not contain texts, and then patterns with a positive output are added to the training set 
as non-text samples. This process iterates until no more patterns are added to the 
training set. This bootstrap method is used to get non-text training samples more 
efficiently and to force the MLP to learn a precise boundary between text and non- 
text classes. 



3 Part-Based Component Filtering 

Although we use the bootstrap method to make the texture classification MLPs learn 
precise boundary between a text class and non-text one, the detection result from the 
MLPs includes many false alarms because we want the MLPs to detect texts as many 
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as possible. In order to tackle this shortcoming of the texture-based method, heuristics 
such as size and aspect ratio of detected regions are used to filtering non-text regions 
in previous researches [2,4,6]. However we still have problems in filtering-out high- 
frequency and high-contrast non-text regions. This is mainly because of the locality of 
the texture information. 

To perform component filtering, the output of the MLPs, or each text region is 
first passed through a 5x5 median filter, which eliminates spurious text pixels. We 
then quantize the color of the input image pixels, corresponding to the text regions, 
into 512 levels (3 bits per color). The different colors present in the image are then 
grouped together using the single-link clustering algorithm to form clusters of similar 
colors. To start with, the clustering algorithm defines each color to be in a group of its 
own. At each step, the single-link clustering algorithm combines the two nearest 
clusters until the distance between them is above a threshold. The distance, d(ci ; C 2 ), 
between two colors, Ci = (ri ; gi ; bi ) and C 2 = (r 2 ; g 2 ; b 2 ), is defined to be d(ci ; C 2 ) 
= kr •‘ 2 ! + |gr § 2 ! + |br b 2 j. To efficiently compute the distance between two clusters, 
we replaced each cluster with the more populous color, while combining two colors. 
The threshold for terminating the clustering was experimentally determined to be 4. A 
typical color image in our experiment generates 9 to 15 clusters. Each pixel in text 
regions of the input image (output of the MLP) is then replaced by the representative 
color of the group it belongs to. CCs are identified from this color image. The 
morphological operation of closing (dilation followed by erosion) is applied to the 
CCs before we compute the attributes of the components such as size and area. We 
have then applied three-stage filtering on the CCs: 

Stage 1: Heuristics using features of CCs such as area, fill factor, and horizontal and 
vertical extents; The width of the text region must be larger than Min_width, the 
aspect ratio of the text region must be smaller than Max_aspect, the area of the 
component should be larger than Min_area and smaller than Max_area, and the fill 
factor should be larger than Min_fillfactor. 

Stage 2: Geometric alignment of text components; We check the number of adjacent 
text components which has a same color in the same text line. Text components have 
to be aligned in more than three consecutive components. 

Stage 3: Part-based Shape information: It is somewhat difficult to precisely determine 
the heuristics which are used in Stage 1 and Stage 2. These heuristics have been 
successfully used in a variety of text detection methods, however, they are limited in 
their ability to detect texts from images containing multi-segment characters such as 
Korean. To efficiently represent shape information of the characters, a part-based 
NMF technique is used. 

The NMF algorithm, devised by Lee and Seung [11], decomposes a given matrix 
into a basis set and encodings, while satisfying the non-negativity constraints. Given 
an n X m original matrix V, in which a column is an n-dimensional non-negative 
vector of the m vectors, NMF decomposes V into an n x r factorized basis W and r x 
m encodings H in order to approximate the original matrix. Then, the NMF algorithm 
constructs approximate factorizations as in Eq. (1). 

V,, = iWH ),., = X 

a = \ 



( 1 ) 
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n m 
i=l /i=\ 

( WH ) 

P(V \ WH ) = exp( -WH ) ^ 

\og P{V \ WH ) = V log( WH )-WH - log F ! 



( 2 ) 

(3) 

(4) 



where the rank r of the faetorization is generally chosen by {n + m)r < nm . 

The NMF algorithm starts from a random initialization of W and H, and 
iteratively updates them until they converge. The NMF algorithm is an iterative 
algorithm with multiplicative update rules that can be regarded as a special variant of 
gradient-descent algorithms [12]. The algorithm iteratively updates W and H by 
multiplicative rules: 

' “ i/j ' * * V/1 









{WHH\ 






2h. 



(5) 



In Figure 5, 4992 character images of six different typefaces are applied to 
establish the basis and encodings. The character images of each typeface are 
composed of n = 28 x 28 pixels, in which the intensity values of all the pixels are 
normalized to the range [0, 1]. As can be seen from Figure 3, the NMF basis and 
encodings contain a large fraction of vanishing coefficients, so both the basis images 
and image encodings are sparse. 



BBMUBBH ga 

mmmmwmmmmnMm 




Fig. 3. Basis Images for Text Components. 



In the Stage 3, a component image obtained from a document image is projected 
into the NMF-projected space using the basis obtained in the training phase, resulting 
in a new feature vector corresponding to the component. The new feature vector is 
compared with the prototypes, and then classified as the text or non-text. 



4 Region Marking 

After neural network and NMF analysis, we perform two different region marking 
approaches according to the given image types. Although lots of works have been 
done for text detection using texture and CC analysis, post-procesing such as region 
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marking and text extraction is rarely done in previous researches. For gray or color 
document image we perform X-Y recursive cut based on the assumption that skew 
correction is done in advance. For video images which have lower text presence rates 
comparing with document images, we use CAMShifl algorithm to increase time 
performance. 

4.1 X-Y Recursive Cut 

For document images have larger text portion than non-text regions, we perform X-Y 
recursive cut on the filtered image. It is a top-down recursive partitioning algorithm. 
This algorithm takes as input a binary image, where the text pixels are black and the 
non-text pixels are white (Figure 4). The algorithm projects the text pixels onto the x 
and y axes and identifies the valleys in the projection. In this paper we assume the 
document image is skew-corrected in advance. The algorithm recursively divides each 
region by projecting it, alternately, onto the x and y axes (Figure 5). 

4.2 CAMShift Algorithm 

For video images, which have lower text presence comparing with non-text region, 
the problem with the texture -based method is its computational complexity in the 
texture classification stage, which is the major source of processing time. Especially a 
texture -based convolution-filtering methods have to exhaustively scan an input image 
in order to detect the text regions [3, 4], and the convolution operation is 
computationally expensive and time-consuming. 

We use a continuously adaptive mean shift (CAMShift) algorithm for text 
extraction in video images [10]. To avoid texture analysis of an entire image with the 
MLP, CAMShift algorithms are invoked at several seed positions on the TPI. This 
leads to great computational savings when text regions do not dominate the image. At 
the initial stage of the CAMShift algorithm on each seed position, we decide whether 
the seed positions contain texts or not using zero-th moment (text detection), and then 
enlarge detected text regions by merging neighbor pixels within a search window in 
successive iterations (text extraction). 

During the iterations, we estimate the position and size of the text region using 2D 
moments, change the search window position and size depending on the estimated 
values, and perform a node-merge operation to eliminate overlapping nodes. If the 
mean shift is larger than either of the threshold values (along x-axis) and Sy (along 
y-axis), the iteration continues (Readers refer the author’s previous publication for 
detail [7]). 



5 Experimental Results 

The proposed text extraction method has been tested with several types of images: 
captured broadcast news, scaimed images, web images, and video clips downloaded 
from the web site of Movie Content Analysis (MoCA) Project [13]. 
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(d) (e) 

Fig. 4. Intermediate results for X-Y recursive cut: (a) input color image, (b) TPI, (c) after CC- 
analysis, (d) color quantized image, and (e) text regions. 




Fig. 5. Transitions of intermediate text detection results. 



Among the training protocols such as on-line, batch, and stochastic trainings, wc 
choose the batch mode. We used a eonstant learning rate p=0.02, momentum m=0.5 
and the sigmoidal activation function f(net)= atanh(b net) with a= 1.7 and b=0.75. We 
used the number of iteration epochs as a stopping criterion: 1500 epochs were 
sufficient for convergence in several experiments with varying input window size. For 
checking the performanee of the MLP as a texture classifier, we summarize the 
precision and recall rates on the pixel level. Without using CAMShift, the MLP is 
performed at every pixel in the given image. 
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(a) 



(b) 

Fig. 6. Extraction examples for color document images. 



(C) 





(g) (h) 

Fig. 7. CC analysis using NMF : (a) input images, (b) after texture-base text detection, (c) and 
(d) CCs of (b), (e) binarization based on CCs, (f) CCs of (e), (g) after CC-based filtering (h) 
after NMF-based filtering. 
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Figure 6 shows examples of text region detection using X-Y recursive cut. There 
are some missing errors for relatively large texts in Figure 6(b). They are probably not 
important for the original purpose of text recognition for image indexing. Figure 7 
shows the example images before and after CC analysis based on NMF. Figure 7(c) is 
the result of CC analysis after the texture -base text detection and Figure 7(e) is a 
binary image from (c). Figure 7(f) and (g) is the image after CC analysis of (e), and 
(h) is a result of NMF-based filtering. Non-text CCs that is hard to remove using only 
CC analysis is removed, and more robust and reliable results are obtained (if it is 
printed in gray-scale, final results are not clearly visible because of the dark and 
complex background). 

Table 1 shows the pixel-level precision and recall rates after marking bounding 
boxes. It shows a comparison result with a modified CCM [4] and another texture- 
based method [7]. The CCM [4] quantized the color spaces into a few prototypes and 
performed a CC analysis on the resulting gray-scale image. Then, each extracted CC 
was verified according to its area, ratio, and alignment. The texture-based method [7] 
adopted an MLP to analyze the textural properties of the texts to identify the text 
regions and used a top-down X-Y recursive cut technique on projection profiles to 
generate bounding boxes. It is clear that the performance of the proposed method is 
superior to the another texture -based method and the CCM. As we generate the text 
probabilities only for pixels within the search window, the average number of pixels 
convolved by MLP is reduced to about a tenth of exhaustive search’s [7]. So is the 
processing time. Moreover, the system has given processing time shorter than 0.05 
second per a frame with shifting* (interval size of 3 pixels for each column and row). 
The processing time in Table 1 is calculated using video frames in size 320x240. 



Table 1. Comparison of precision and recall rates for video images 





CC-only [6] 


MLP-only[8] 


MLP + CC TCAMShift 


Time(seconds) 


0.1 


4.63 


0.47 


Precision (%) 


78.8 


73.1 


95.7 


Recall (%) 


84.5 


91.2 


89.3 



We tried to compare our method’s performance with others. However, there is no 
public comprehensive database on video sequences containing captions, and each 
researcher used his/her own database. So it is hard to do a comprehensive objective 
comparison. In this experiment, we can download the sample video clips from MOCA 
project Web site [13]. 



6 Conclusions 

In this paper, an efficient text extraction technique using a combined method of neural 
network-based detection and NMF-based filtering is presented. Detection of texts in 
various conditions can be automatically performed using neural networks without any 
explicit feature extraction stage. The main drawback of the texture -based method is in 

' Here, shifting means the technique that classifies pixels at regular intervals and inteipolates 
pixels located between the classified pixels. 
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its locality property, which means that it does not consider the outside of its window. 
It encourages us to use the hybrid method of texture and CC-based methods. 
Therefore we aim to increase a recall rate with MLP, and then a precision rate with 
NMF-fdtering -based CC analysis. Moreover, by adopting CAMShift we do not need 
to analyze the texture properties of an entire image, which results in enhancement of 
time performance. The proposed method works particularly well in extracting texts 
from the complex and textured background and shows a better performance than the 
other proposed methods in open literature. For the future work, incorporating an OCR 
algorithm with this text extraction method is needed. For this we have to consider 
enhancing the resolution of the extracted characters and more flexible OCR 
technology for low-resolution characters. 
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Abstract. In the place in which many people gather, we may find a 
suspicious person who is different from others from a security viewpoint. 
In other words, the person who takes a peculiar action is suspicious. In 
this paper, we describe an application of our peculiarity oriented mining 
approach for analysing in image sequences of tracking multiple walking 
people. A measure of peculiarity, which is called peculiarity factor, is 
investigated theoretically. The usefulness of our approach is verified by 
experimental results. 



1 Introduction 

In the place such as a station or an airport in which many people gather, many 
sacrihees will come out when the terrorism happens. From a security viewpoint, 
we need to find a suspicious person in such a place. Although the suspicious 
person can be discovered by using the surveillance camera with a video, not all 
people can be checked automatically. 

In this paper, we describe an application of our peculiarity oriented mining 
approach for analysing image sequences of tracking multiple walking people. We 
observed that a suspicious person usually takes action which is different from 
other people, so-called peculiar action, such as coming from and going to the 
same place and stopping at some place. Hence, our peculiarity oriented mining 
approach [8, 9] can be used to analyse such data automatically. Such peculiarity 
oriented analysis is a main difference between our approach and other related 
work on analysis of human movement [1, 4]. 

The rest of the paper is organized as follows. Section 2 investigates how to 
identify peculiar data in our peculiarity oriented mining approach. Section 3 
discusses the application of our approach for automatically analysing image se- 
quences of tracking multiple walking people, which were obtained by using the 
surveillance camera. The experimental results show the usefulness of our ap- 
proach. Finally, Section 4 gives concluding remarks. 

2 Peculiar Data Identification 

The main task of peculiarity oriented mining is the identification of peculiar 
data. An attribute-oriented method, which analyzes data from a new view and 
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is different from traditional statistical methods, is recently proposed by Zhong 
et al. [8, 9]. 

2.1 A Measure of Peculiarity 

Peculiar data are a subset of objects in the database and are characterized by 
two features: 

( 1 ) very different from other objects in a dataset, and 

(2) consisting of a relatively low number of objects. 

The first property is related to the notion of distance or dissimilarity of objects. 
Institutively speaking, an object is different from other objects if it is far away 
from other objects based on certain distance functions. Its attribute values must 
be different from the values of other objects. One can define distance between ob- 
jects based on the distance between their values. The second property is related 
to the notion of support. Peculiar data must have a low support. 

At attribute level, the identification of peculiar data can be done by finding 
attribute values having properties (1) and (2). Table 1 shows a relation with 
attributes Ai, A 2 , . . ., Am- Let Xij be the value of Aj of the i-th tuple, and n 
the number of tuples. Zhong et al. [8, 9] suggested that the peculiarity of Xij can 
be evaluated by a Peculiarity Factor, PF{xij), 



PF{xij) = '^N{x^j,XkjT, ( 1 ) 

fc=i 

where N denotes the conceptual distance, a is a parameter to denote the im- 
portance of the distance between Xij and Xkj, which can be adjusted by a user, 
and a = 0.5 as default. 



Table 1. A sample table (relation) 



Al 


A2 








Am 


Xll 


X12 




Xlj 




^Im 


X21 


X22 




X 2 j 




^ 2 m 














Xil 


Xi 2 




Xij 


















Xnl 


X „2 




Xnj 




^nm 



With the introduction of conceptual distance, Eq. (1) provides a more flex- 
ible method to calculate peculiarity of an attribute value. It can handle both 
continuous and symbolic attributes based on a unified semantic interpretation. 
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Background knowledge represented by binary neighborhoods can be used to eval- 
uate the peculiarity if such background knowledge is provided by a user. If X 
is a continuous attribute and no background knowledge is available, we use the 
following distance: 

^ i.^ij 1 — \^ij I ■ (^) 

If X is a symbolic attribute and the background knowledge for representing the 
conceptual distances between Xij and Xkj is provided by a user, the peculiar- 
ity factor is calculated by the conceptual distances [3,6, 8,9]. The conceptual 
distances are assigned to 1 if no background knowledge is available. 

Based on peculiarity factor, the selection of peculiar data is simply carried 
out by using a threshold value. More specifically, an attribute value is peculiar 
if its peculiarity factor is above minimum peculiarity p, namely, PF(xij) > p. 
The threshold value p may be computed by the distribution of PF as follows: 

p = mean of PF{xij) + P x standard deviation of PF{xij ) (3) 

where /3 can be adjusted by a user, and /3 = I is used as default. The threshold 
indicates that a data is a peculiar one if its PF value is much larger than the 
mean of the PF set. In other words, if PF{xij) is over the threshold value, Xij 
is a peculiar data. By adjusting the parameter /3, a user can control and adjust 
threshold value. 



2.2 Analysis of the Peculiarity Factor 

A question arises naturally is whether the proposed peculiarity factor reflects 
our intuitive understanding of peculiarity (i.e. the properties (1) and (2) as 
mentioned previously). More specifically, whether a high value of Eq. (1) indi- 
cates Xij occurs in relatively low number of objects and is very different from 
other data Xkj ■ Although many experiment results have shown the effectiveness 
of the peculiarity factor, a detailed analysis may bring us more insights [7]. 



Table 2. Attribute values and its frequency 



attribute value 


frequency 


Xl 


m 


X2 


ri2 






Xh 


rih 


Total 


n 



In order to analyze Eq. (1), we adopt a distribution form of attribute value. 
In Table 2, let {xi, . . . ,Xh} be the set of distinguishing values of an attribute. 
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With respect to the distribution, the PF(xi) can be easily computed by: 

h 

PF{xi) = X N{x^,xk)'^. (4) 

k^l 

Let now consider two special cases, in order to have a better understanding of 

PF. 

Case 1-1. Assume that all attribute values have the same frequency, namely, 
n\ = U 2 = ■ ■ ■ = Tih = h/n. In this case, we have: 

h ^ 

PF{xi) = -y^N{x,,XkT. ( 5 ) 

fe=i 

Since h/n is a constant independent of any particular value, the PF value de- 
pends only on the total distances of Xi and other values. A value far away from 
other values would be considered to be peculiar. 

Case 1-2. Assume that the distance between a pair of distinguish values are 
the same, namely, N{xi,Xk) = C for z fc and N{xi,Xi) = 0. In this case, we 
have: 

PF{xi) = {n — ni)C = nC — UiC. (6) 

Since nC is a constant independent of any particular value, the PF value is 
monotonic decreasing with respect to the value oiui. A value with low frequency 
will have a large PF, and in turn, is considered to be peculiar. As expected, the 
distances between Xi and other values are irrelevant [7] . 

In general, PF depends on both the distribution and the individual dis- 
tances N{xi, Xk)- Several qualitative properties can be said about the peculiarity 
factor based on Eq. (4): 

— A value with low frequency tends to have a higher peculiarity value. This 
follows from the fact J2k=i = n and there are n — Ui distances to be 
added for Xi. 

— Each term in Eq. (4) is a product of frequency Uk and the distance N {xi, Xk)- 
This suggests that a value far way from more frequent values is likely to be 
peculiar. On the other hand, a value for away from less frequent values may 
not necessarily peculiar, due to a small value of rife. A value closer to very 
frequent values may also be considered to be peculiar, due to the large value 
of rife. Those latter properties are not desirable properties. 

— Eq. (4) can be rewritten as: 



h 

PF{x,)=nY^'^xN{x,,Xkr. ( 7 ) 

fc=i 

Thus, the peculiarity factor is in fact a weighted average of distances between 
Xi and other values. It is the expected distance of N{xi,Xk)°^ with respect 
to probability distribution (ni/n,ri 2 /n, . . . ,nh/n). Under this view, a value 
is deemed peculiar if it has a large expected distance to other values. 
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From the above analysis, we can conclude that the peculiarity factor has 
some desired properties and some undesired properties. The main problem may 
stem from the fact that average is used in the calculation of peculiarity factor. 
A best average does not necessarily imply a best choice. Consider the following 
distribution: 



attribute value frequency 

Xi = 1 = 10 

X2 = 5 ri2 = 1 

X3 = 10 ^3 = 10 

Total n = 21 

Assume a = 1 and N{xi,Xk) = \xi — Xk\- We have the following peculiarity 
values: 

PF{xi) = 64, PF{x 2 ) = 60, PF{x 3 ) = 62. 

On the other hand, X 2 = 5 seems to be peculiar rather the other two. Further- 
more, although a user can adjust the parameter [3 in the selection of threshold 
value for peculiarity data selection, its usefulness is limited. The notion of pe- 
culiarity, as defined by Eq. (4), mixes together two notions of frequency and 
distance. Although it is based on a sound theoretic argument, its meaning can- 
not be simply explained to a non-expert. 

Furthermore, a in Eq. (1) can be also considered two special cases with 
respect to the two cases stated above. 

Case 2-1. Assume a » n. This means N{xij,Xkj)°^ >> n*. Hence, 

h h 

PF{xi) = ^ nfe X N{xi, XkT - X! 

k=l fc=l 

This case is the same as Case 1-1 stated above. In other words, the PF value 
depends only on the total distances of Xi and other values. 

Case 2-2. Assume a ^ 0. Hence, 



This case is the same as Case 1-2 stated above, when C = 1. Based on Eq. (6), 
we have 

PF{xi) = (n — rii)C = nC — ruC = n — rii. (10) 

In other words, the PF value depends only on the distribution Uk- 

Figure I shows the relationship between the distance and PF when a changed 
in Eq. (1). By adjusting the parameter a, a user can control and adjust the degree 
of PF that depends on both the distribution and the distance. And according 
to experience, a = 0.5 will get a good balance between the distribution and the 
distance. 
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Fig. 1. The relationship between the distance and PF when a changed 



2.3 An Algorithm 

Based on the above-stated preparation, an algorithm of hnding peculiar data 
can be outlined as follows: 

Step 1. Execute attribute oriented clustering for each attribute, respectively. 
Step 2. Calculate the peculiarity factor PF{xij) in Eq. (1) for all values in an 
attribute. 

Step 3. Calculate the threshold value in Eq. (3) based on the peculiarity factor 
obtained in Step 2. 

Step 4- Select the data that are over the threshold value as the peculiar data. 
Step 5. If the current peculiarity level is enough, then goto Step 7. 

Step 6. Remove the peculiar data from the attribute and thus, we get a new 
dataset. Then go back to Step 2. 

Step 7. Change the granularity of the peculiar data by using background knowl- 
edge on information granularity if the background knowledge is available. 

Furthermore, the algorithm can be done in a parallel-distributed mode for mul- 
tiple attributes, relations and databases because this is an attribute-oriented 
finding method. 

3 Application in Analysing Image Seqnences of Tracking 
Mnltiple Walking People 

Peculiarity oriented mining has been applied to analyse image sequences of track- 
ing multiple walking people. The path of tracking data of each walking people 
are used to discover the person action pattern and to detect a person’s unique 
action. The data were obtained by using the surveillance camera with a video, 
and preprocessed by using person tracking technology developed at OKI Electric 
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Industry Co., Ltd. In this experiment, we used the video photoed at a station 
ticket wicket (Fig. 2). The purpose is to discover the person who has taken a 
suspicious action. 




Fig. 2. Multiple walking people at a station ticket wicket 



3.1 Data Preparation 

The original attributes on tracked image sequences of multiple walking people 
might not be suitable directly for our peculiarity oriented mining approach. 
Hence, a key issue is how to generate the attributes to meet our needs from the 
original data. 

At first, the raw data are changed into the coordinates and are given in CSV 
format for every frame by person tracking technology. Each instance is indicated 
in the following attributes. 

— ID: The unique ID attached to each people under tracking. 

— FrameNumber: The frame number in the video. 

— Status: The following states are used to describe the state of tracking. 

0: Undefined (unstable state at the time of a tracking start), 

1: Definited (stable tracking), and 
2: Lost (out of the tracking). 

— Xi (coordinate): Raw data under tracking - a left end is 0 and a right end 
is 256. 

— Yi (coordinate): Raw data under tracking - a up end is 0 and a down end is 
240. 

— X 2 (coordinate): The smoothed data - a left end is 0 and a right end is 256. 

— Y 2 (coordinate): The smoothed data ~ a up end is 0 and a down end is 240. 
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In this experiment, we do not use attributes Status, Xi and Y\, although 
Status is used in person tracking technology. Further, X2 and Y2 will be regarded 
as the same cases as referred to Xi and Yi, respectively. This is because Xi and 
Yi are unstable, and they are covered by X2 and Y2. 

The following attributes are used as an attribute for specifying a person’s 
action; ID, In (the direction included in the photography range). Out (the di- 
rection left from the photography range), and seg-n (the segment number: the 
number changed the advance direction after going into the photography range 
before coming out of the range). In and Out are calculated from the coordi- 
nates, respectively, when a person appears for the first time, and the last, seg-n 
is calculated from the number of the divided line segment. 



3.2 Linearization of Walking Data 

Usually, people goes straight on to the destination from the current position when 
acting with a goal, if no obstacle prevented him/her from going. In other words, 
he/she will walk back and forth in a certain range, not to mention a detour, 
or will stop, if the person is at a unusual state, without a specific goal, and so 
on. Hence, whether the behavior of a person is usual or not can be analysed by 
calculating the segment number (i.e. the number of changed direction) in the 
linearized walking data of each person. 

In order to calculate the segment number, linearization of walking data needs 
to be first performed. An algorithm of linearization can be outlined in Fig. 3. 
An example of the linearization is shown in Fig. 4. 




Fig. 3. Linearization flowchart 
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Fig. 4. An example of the linearization 



3.3 Simulations 

Experiment 1 The following parameters were used to calculate PF, a = 0.5, 
/3 = 1. And the linearization of walking data has a 20-point error margin. There 
are 26 persons who have been judged as taking peculiar actions in this experi- 
ment. A part of the result is given Table 3. 



Table 3. A part of result 1 



ID 


In 


Out 


seg-n 


2004 


down 


up 


3 


2010 


up 


left 


2 


2019 


rightup 


leftup 


1 


2039 


rightup 


leftdown 


2 


2175 


down 


leftdown 


2 


2270 


up 


left 


1 


2272 


left down 


up 


2 


2353 


left down 


center 


1 



By comparing the result shown in Table 3 with the actual movie, we can 
see that only 3 persons can be regarded as peculiar ones. The reason why the 
result is not more exact is that the attributes used in this experiment may be 
insufficient. Hence, it is necessary to add attributes for describing human’s action 
pattern more specifically. 

Experiment 2 Based on the above experiment, we added a new attribute, 
frame-n (i.e. number of frames), which was staying at the photography range. 
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And the parameters for calculating PF were set to a = 0.5, (3 = 2. The reason 
why we set /3 = 2 is that points which persons pass in different photography 
ranges have a bigger unevenness. For example, some people passes along middle 
and other persons passe along an end of the photography range. 

The result of this experiment is shown in Table 4. We can see there are 5 
persons judged as taking peculiar action in this experiment. However, the value 
of attributes In or Out, “center” , means the person who was in the photography 
range at the time of the start of photography, or the end. Hence, such person 
cannot be judged to take a peculiar action. As a result, only 3 persons have been 
considered as taking peculiar actions. 



Table 4. Result 2 



ID 


In 


Out 


seg-n 


frame- n 


2004 


down 


up 


3 


286 


3020 


left 


up 


3 


286 


4004 


center 


up 


2 


286 


5024 


right 


up 


2 


279 


5171 


down 


center 


2 


275 



Experiment 3 Based on experiments 1 and 2, we added two more attributes 
speed (pixel/frame-n) and angle (the average angle of all turnings, i.e., the to- 
tal angle/# of turnings) for experiment 3. The angle can be calculated by the 
linearized segment. It was set to 180 degree if not turning. Furthermore, the 
parameters for calculating PF were set to a = 0.5, /3 = 1 for attribute speed, 
and to a = 0.5, (3 = 2 for attribute angle, respectively. 

As a result as shown in Table 5, we can see there are 16 persons, including 
the case as shown in Fig. 4, who have been judged as taking peculiar actions. 



Table 5. A part of result 3 



ID 


In 


Out 


seg-n 


frame-n 


speed 


angle 


2004 


down 


up 


3 


286 


1.0 


119 


2010 


middle 


left 


2 


105 


1.7 


128 


3020 


left 


left up 


2 


286 


1.0 


136 


3031 


up 


left down 


3 


237 


1.1 


120 


5134 


down 


rightup 


3 


223 


1.2 


134 



Although not all the detected persons are peculiar ones, all suspicious persons 
have been discovered. We observed that the pattern of the whole person’s stream 
will change depending on different time zones. Hence, it is necessary to compare 
the detected peculiar actions with a more general pattern in each time zone. 
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4 Conclusions 

We presented an application of our peculiarity oriented mining approach for 
analysing in image sequences of tracking multiple walking people. The strength 
and usefulness of our approach have been investigated theoretically and demon- 
strated by experimental results. 

In order to increase accuracy, it is necessary to evaluate in multiple stages. 
Moreover, a general rule (i.e. the pattern of the whole person’s stream) needs to 
be discovered, and will be helpful to recognize suspicious persons quickly. 
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Abstract. For discovering hidden (latent) variables in real-world, non- 
gaussian data streams or an n-dimensional cloud of data points, SVD 
suffers from its orthogonality constraint. Our proposed method, “Au- 
toSplit”, finds features which are mutually independent and is able to 
discover non-orthogonal features. Thus, (a) finds more meaningful hid- 
den variables and features, (b) it can easily lead to clustering and seg- 
mentation, (c) it surprisingly scales linearly with the database size and 
(d) it can also operate in on-line, single-pass mode. We also propose 
“Clustering- AutoSplit” , which extends the feature discovery to multiple 
feature/bases sets, and leads to clean clustering. Experiments on multi- 
ple, real-world data sets show that our method meets all the properties 
above, outperforming the state-of-the-art SVD. 



1 Introduction and Related Work 

In this paper, we focus on discovering patterns in (multiple) data streams like 
stock-price streams and continuous sensor measurement, and multimedia objects 
such as images in a video stream. Discovery of the essential patterns in data 
streams is useful in this area, for it could lead to good compression, segmentation, 
and prediction. We shall put the related work into two groups: dimensionality 
reduction and streaming data processing. 

Dimensionality reduction/ feature extraction. Given a cloud of n points, 
each with m attributes, we would like to represent the data with fewer at- 
tributes/features but still retain most of the information. The standard way 
of doing this dimensionality reduction is through SVD (Singular Value Decom- 
position). SVD finds the best set of axes to project the cloud of points, so that 
the sum of squares of the projection errors is minimized (Figure 1(a)). SVD has 

* Supported in part by Japan-U.S. Cooperative Science Program of JSPS; grants 
from JSPS and MEXT (#15017207, #15300027); the NSF No. IRI-9817496, IIS- 
9988876, IIS-0113089, IIS-0209107, IIS-0205224, INT-0318547, SENSOR-0329549, 
EF-0331657; the Pennsylvania Infrastructure Technology Alliance No. 22-901-0001; 
DARPA No. N66001-00-1-8936; and donations from Intel and Northrop-Grumman. 



H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 519—528, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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been used in multiple settings: for text retrieval [1], under the name of Latent Se- 
mantic Indexing (LSI); for face matching in the eigenface project [2]; for pattern 
analysis under the name of Karhunen-Loeve transform [3] and PC A [4]; for rule 
discovery [5]; and recently for streams [6] and online applications [7]. Recently, 
approximate answers of SVD for timely response to online applications have also 
been proposed [8,9,10,11]. 

Streaming data processing. Finding hidden variables is useful in time series 
indexing and mining [12], modeling [13,14], forecasting [15] and similarity search 
[16,17]. Fast, approximate indexing methods [18,19] have attracted much atten- 
tion recently. Automatic discovery of “hidden variables” in, for example, object 
motions would enable much better extrapolations, and help the human analysts 
understand the motion patterns. 





(a) (Energy) Right knee vs left knee (b) Hidden variables hi and h .2 




(c) Take off (d) Landing 



Fig. 1. (AutoSplit versus SVD/PCA: “Broad jumps”) (a): the right knee energy ver- 
sus left knee energy during the jumps: take off (c) and landing (d). Two jumps are 
performed at time ticks 100 and 380. In (a), (Batch-)AutoSplit vectors bi, slope 1:1, 
corresponds to “landing”; b 2 , slope -1:60, for “take off”, (b) the hidden variables hi 
(top) and h 2 (bottom) of bi and b 2 , respectively. 



Although popularly used, SVD suffers from its orthogonality requirement for 
real world data whose distribution is not gaussian. For example, in Figure 1(a), 
SVD proposes the two dash orthogonal vectors as its basis vectors, while com- 
pletely missing the “natural” ones (bi, b 2 ). Is there a way to automatically find 
the basis vectors bi and b 2 ? Generally, we would like to have a method which (a) 
finds meaningful feature vectors and hidden variables (better coincide with the 
true unknown variables which generate the observed data streams), (b) can work 
in an unsupervised fashion, (c) scales linearly with the database size, (d) is able 
to operate in on-line, single-pass mode (to cope with continuous, unlimited data 
streams). Experiments on multiple, real data sets, from diverse settings (motion 
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capture data for computer animation, stock prices, video frames) show that the 
proposed AutoSplit method and its variants achieve the above properties. 

2 Proposed Method 

There are two major concepts behind AutoSplit: the basis vectors, and the hidden 
variables. The basis vectors are the analog of the eigenvectors of SVD, while the 
hidden variables are the sources/ variables controlling the composition weights 
of these basis vectors when generating the observed data. We use the “broad 
jumps” data set in section 1 to illustrate these concepts. 

Basis vectors and hidden variables. The “broad jumps” data set (Figure 
1) is a motion capture data set from [20]. The actor performed two broad jumps 
during the recording period. Our data set is a n-by-m data matrix X=[a:ij] with 
n=550 rows (time-ticks) and m=2 columns (left and right knee energy). Fig- 
ure 1(a) shows the scatter plot of the data points: Xi^i versus Xi^ 2 , or, informally: 
right-knee(j) versus left-knee(i), for time ticks i = For visualization 

purposes only, data points at successive time-ticks are connected with lines - 
neither SVD nor AutoSplit use this sequencing information. 

For the “broad jumps” data, we would like to (a) discover the structures of 
the action (take-off, landing), and (b) partitions the action sequences into ho- 
mogeneous segments. Notice that the majority of points are close to the origin, 
corresponding to the time when the knees are not moving much (idle and “fly- 
ing”). However, there are two pronounced directions, one along the 45° degree 
line, and one almost vertical. As shown, SVD fails to spot either of the two 
pronounced directions. On the other hand, AutoSplit clearly locks on to the two 
“interesting” directions, unsupervisedly. 

(Observation 1) Playing the animation and keeping track of the frame- 
numbers (= time-ticks), we found that the points along the 45° degree line (h\) 
are from the landing stage of the action (2 knees exert equal energy, Figure 1(d)), 
while the ones on the near-vertical AutoSplit basis (h^) are from the take-off stage 
(only right knee is used. Figure 1(c)). 

Exactly because AutoSplit finds “good” basis vectors, most of the data points 
lie along the captured major axes of activities. Thus, the encoding coefficients 
(hidden variables) of any data point are all closed to zero, except for the one 
which controls the axis on which the data point lies (Figure 1(b)). Inspecting 
the data found that hi controls the landing, while h .2 controls the take off. The 
distinct “fire-up” periods of the hidden variables could lead to clean segmenta- 
tion. 



2.1 AutoSplit: Definitions and Discussion 

As shown in Figure 1(a), the failure of SVD partly comes from the orthogonal 
constraint on its basis vectors. Since our goal is to And clean features, therefore, 
instead of constraining on the orthogonality, we search for basis vectors which 
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maximize the mutual independence among the hidden variables. The idea is that 
independence leads to un- correlation, which implies clean, non-mixed features. 

(Definition 1: Batch-AutoSplit) Let X[„xm] be the data matrix repre- 
senting n data points (rows) with m attributes (columns). We decomposes Xj^xm] 
as a linear mixture of I bases (rows of the basis matrix B^ixm]/^; with weights 
in the columns of the hidden matrix Hj^xi] (^' the number of hidden variables, 
I < m). 

X[nxm] ~ H ^nxl] ^ [Ixm] ; 



Xl- 










1 

1 




to 


■ hii 




1 

1 


X2- 
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■ 1 1 • 
fii h-2 • 

1 1 


• r 

• hi 

1 




— 1)2 — 


= 


^21 ^22 ■ 


■ h2i 
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Xn _ 




LI 1 • 


• 1 J 




-bi-_ 




hnl ^n2 ’ 






-bi-_ 



We model each data point Xi as a linear combination of the basis vectors bk 
(features). The value bkj indicates the weight of the j-th attribute for the fc-th 
hidden variable, where hidden variables represent the unknown data generating 
factors (e.g., the economic events to a share price series). In the “broad jumps” 
example, there are n=550 data points Xi, each is m=2 dimensional. The 1=2 basis 
vectors were 2-d vectors, with values bi=(15.51, 14.12) (^ 45° degree line), and 
b 2 =(— 0.29, 17.65) (vertical line). 

Figure 2 gives the outline of Batch-AutoSplit. The analysis behind Batch- 
AutoSplit is based on ICA (Independent Component Analysis) [21]. There is 
a large literature on ICA, with several alternative algorithms. The one most 
related to Batch-AutoSplit is [22]. 

Batch-AutoSplit assumes the number of hidden variables is the same as the 
number of attributes, i.e., B is a square matrix. The number of hidden variables, 
/, is controlled by the whitening step of the algorithm. The whitening of matrix 
A is obtained by first making A zero mean (and we get Aq), and then apply 
SVD on Ao=U2lV'^. The whitening result is A=Ui, where Ui is the first I 
columns of U. Also, the Frobenius norm of the matrix A[„xm] is defined as 



— Given: The data matrix X, and a randomly initialized B. 

— Step 1; Whiten the data matrix X, and get X. Initialize AB, such that ||eAB||ir > 
S, where e controls the size of each gradient step, and is usnally reduced as more 
iterations are done, and S is some pre-defined threshold. 

— Step 2: While ||eAB||F > 

• Step 2.1: Compute H=XB^^. 

• Step 2.2: Compute the gradient AB = — B'^Z^H — nB^, where Z= — sign(H). 

• Step 2.3: Update B=B+eAB. 



Fig. 2. Batch-AutoSplit algorithm: Bout=Batch-AutoSplit(X,B) 
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2.2 Proposed Method: AutoSplit and Clustering- AutoSplit 

Figure 3 extends the basic Batch-AutoSplit algorithm to process online data 
streams. AutoSplit takes into a infinite stream of data items (grouped into win- 
dows XqjXo,. . . ) , and continuously output the estimated bases Bo,Bi,. . . . The 
memory requirement of the AutoSplit algorithm is tunable, by setting the num- 
ber of data items (n) to be processed at each loop iteration. The computation 
time at each update is constant (0(1)), given fixed number of data points at 
each update. In fact, the actual time for each update is tiny, for only a couple 
of small matrix multiplications and additions are required at each update. Note 
that AutoSplit is capable of adapting to gradual changes of hidden variables, 
which is desirable for long-term monitoring applications, where the underlying 
hidden variables may change over time. 



— Given: A infinite stream of data items, every n items are grouped as a data batch 
Xi, {1 = 1,2,...). Data patches could be overlapped. 

— Step 1: When Xq is available, initialize Bq randomly. Let I = 0. 

— Step 2: While Xi+i is not available, do 

• Bi = Batch-AutoSplit(Xi, Bi). 

— Step 3: Bi+i=Bi. Goto Step 2. 



Fig. 3. AutoSplit algorithm; (Bo,Bi,. . . ) = AutoSplit (Xo,Xi,. . . ) 



Real world data is not always generated from a single distribution, instead, 
they are from a mixture of distributions. Many studies model this mixture by 
a set of Gaussian distributions. Since real world data is often non-Gaussian, we 
propose “Glustering- AutoSplit” , which fits data as a mixture of AutoSplit bases. 




Fig. 4. Clustering-AutoSplit. The number of clusters is set to A: = 2. Each set of Au- 
toSplit bases is centered at the mean of data points in a cluster. 



Figure 4 shows an synthetic example of 2 clusters on the 2-dimensional plane. 
Each cluster is specified by a set of 2 bases. Note that by fitting the data into the 
mixture model, the data items are automatically clustered into different classes. 
Figure 5 gives the outline of the Glustering-AutoSplit algorithm. Note that we 
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can also use AutoSplit algorithm, instead of Batch-AutoSplit, in Step 2.5 of the 
algorithm, yielding an online clustering algorithm. 



— Given: k is the number of clusters to be found, and X has the data items. 

— Step 1: Initialize Bj’s and cj’s randomly, j = 1, . . . , fc. Bj and cj are the bases and 
the mean of the j data cluster, respectively. 

— Step 2: While the changes on cj’s remain large (above some threshold 5), 

• Step 2.1: For each data point x;, compute its likelihood of belonging to the 
j-th cluster, 

r _ rir=l fhjhir) 

\det{B,)\ ’ 

where Xi = hirhjr, I is the number of bases for cluster j, and fh{hir) oc 

exp{-\hir\). 

• Step 2.2: Compute the relative weight of Xi to cluster j, pij = /o/(X]fe=i fkj)- 

• Step 2.3: Update Cj’s: Cj = 

• Step 2.4: Cluster each point Xi to the cluster of maximum likelihood. Let the 
k clusters be Ci, . . . , Ck. 

• Step 2.5: For each cluster j, update Bj=Batch-AutoSplit(Cj,Bj) (Figure 2). 



Fig. 5. Clustering-AutoSplit: (Bi,ci, ... ,Bk,Ck)=Clustering-AutoSplit(A:,X) 



3 Experimental Results 

In this section, we show the experimental results of applying (a) AutoSplit to 
the real world share price sequences and (b) Clustering-AutoSplit to the video 
frames. We also empirically examine the quality and scalability of AutoSplit. 

3.1 Share Price Sequences 

The share price data set (DJIA) contains the weekly closing prices of the m=29 
companies in the Dow Jones Industrial Average, starting from the week of Jan- 
uary 2, 1990 to that of August 5, 2002, and gives data at n=660 time ticks per 
company. Closing prices collected at the same week/time-tick are grouped into 
a 29-D vector, i.e., a company is an attribute. The resulting data matrix X is 
660-by-29. 

Before doing AutoSplit, we preprocess the data matrix X and make the 
value of each company/attribute zero-mean and unit- variance. To extract 1=5 
hidden variables, the dimensionality is first reduced to 5 from 29 using whitening 
(section 2.1). Figure 6(b) shows the 2 most influential hidden variables (hi, h 2 ). 
We would like to understand what do these hidden variables stand fori 

Table 1(a) lists the top 5 companies with largest and smallest contributions 
bij to the hidden variable hi (top of Figure 6(b)), where index j is the index 
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(a) 5 members of DJIA (b) Hidden variables of DJIA 



Fig. 6. (Share closing price (DJIA, 1990-2002)) (a) AA: Alcoa, AXP: America Express, 
BA: Boeing, CAT: Caterpillar, C: CitiGroup. (b) (Top) Probably the general trend of 
share prices. (Bottom) Probably the Internet bubble. 



Table 1. Company contributions according to the hidden variables: hi, h 2 . j is the in- 
dex to companies. (INTC: Intel, AXP:American Express, DIS:Disnet, MO:Philip Mor- 
ris, PG:Procter and Gamble, DD: Du Pont) 
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0.900317 


HWP 


0.658768 




DIS 


0.490529 


DD 


0.133337 
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to the different companies. As shown, all companies have strong positive contri- 
butions (about 0.6 to 0.9) except AT&T (symbol: T). The regularity of contri- 
butions among all companies suggests that hi represents the general trend of 
share price series. 

On the other hand, the hidden variable h .2 (bottom of Figure 6(b)) is mostly 
silent (near zero- value) except a sharp rise and drop in year 2000. This seems 
to correspond to the “Internet bubble”. To verify this, we can check the compa- 
nies’ contributions 62 j on this hidden variable h2- Table 1(b) lists the companies 
having the 5 highest and 5 lowest contributions 62,1 on h2- Companies having 
big contributions are mostly technical companies and service (financial, enter- 
tainment) providers, while those of near-zero contributions are bio-chemical and 
traditional industry companies. Since the technical companies are more sensitive 
to h2, we suggest that h2 corresponds to the “Internet bubble” which largely 
affected technical companies during year 2000-2001. 

(Observation 2) AutoSplit automatically discovered the meaningful under- 
lying factors, namely, the general trend and the Internet bubble (Figure 6(b)). 

(Observation 3) We found rules like “companies in financial and traditional 
industry grow steadily during 1990-2002, while technical companies suffered from 
the event around late-2000” , and detected outliers like AT&T, which does not 
follow the general growth trend during the period 1990-2002 (Table 1(a)). 
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(a) “Broad jumps” (b) Time vs data set size 




(c) Time vs dimensionality (d) Time per iteration 



Fig. 7 . (a): Bases found by Batch- AutoSplit (solid), PCA (dash), and AutoSplit (dash- 
dot) on the “broad jumps” data set. (b)(c)(d): Scalability AutoSplit. 



3.2 AutoSplit : Quality and Scalability 

Can we operate AutoSplit on a continuous data stream? If yes, at what accuracy 
loss? Here we compare our “AutoSplit”, the online processing algorithm, with 
the Batch- AutoSplit. Figure 7(a) shows the bases generated by AutoSplit, along 
with those by Batch-AutoSplit and PCA. Batch-AutoSplit has access to the 
complete data, so it gives near perfect bases (solid vectors). AutoSplit gives 
bases (dash-dot vectors) which are very close to the true bases. 

We also studied the scalability of AutoSplit with respect to the data set 
size and to the data dimensionality. To study the effect of data set size, we 
generate 2-D synthetic data set similar to our “broad jumps” data set, but with 
more data points (vary from n=10^ to 10®). Figure 7(b) shows the total running 
time of AutoSplit (until convergence) versus the data set size. The total running 
time of AutoSplit is linear (0(n)) to the data set size (n), as expected, for the 
computational cost per iteration is constant (several matrix multiplications). 
Figure 7(d) shows the small constant computational cost per iteration (« 0.12 
msec), when AutoSplit is applied to the “broad jumps” data set. 

To study the effect of dimensionality, we fixed the data set size to n=35, 000, 
while varies the dimensionality from m=10 to 70. Synthetic data of higher di- 
mensionality are generated to have a non-orthogonal distribution similar to a 
high-dimensional version of the “broad jumps” data set with more “spikes”. 
The total running time (Figure 7(c)) of AutoSplit is super-linear, and probably 
quadratic (0(m^)). 
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3.3 Experiment with Clustering- AutoSplit 

We apply Clustering- AutoSplit to separate two texture classes in an image: over- 
lay text and background. The idea is to find two different sets of bases for image 
patches of the two classes. Figure 8(al)(a2) show Clustering-AutoSplit (with 
k=2) gives good separation of the overlay text from the background in a video 
frame [23]. The data items (each is a 36-dimensional row vector in X) are the 
6-by-6 pixel blocks taken from the frame. Figure 8(bl)(b2) show the failure of 
the PCA-based mixture model (MPPCA, Mixture of Probabilistic PCA [24]) on 
this task. MPPCA fails to differentiate the background edges with the real texts. 




(al) Background (a2) Overlay text (bl) Background (b2) Overlay text 



Fig. 8. Texture segmentation. Result from (a) Clustering-AutoSplit (b) MPPCA. 



4 Conclusions 

We propose AutoSplit, a powerful, incremental method for processing streams 
as well as static, multimedia data. The proposed “Clustering-AutoSplit” extends 
the feature discovery to multiple feature/bases sets and shows a better perfor- 
mance than the PCA-based method in texture segmentation (Figure 8). The 
strong points of AutoSplit are: 

— It finds bases which better capture the natural trends and correlation of the 
data set (Figure 1(a)). 

— It finds rules, which are revealed in the basis matrix B (Observation 1,3). 

— It scales linearly with the number n of data points (Figure 7(b)). 

— Its incremental, single-pass algorithm (Figure 3) makes it readily suitable 
for processing on streams. 
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Abstract. The string matching and global word frequency model are two basic 
models of Document Copy Detection, although they are both unsatisfied in 
some respects. The String Kernel (SK) and Word Sequence Kernel (WSK) may 
map string pairs into a new feature space directly, in which the data is linearly 
separable. This idea inspires us with the Semantic Sequence Kin (SSK) and we 
apply it to document copy detection. SK and WSK only take into account the 
gap between the first word/term and the last word/term so that it is not good for 
plagiarism detection. SSK considers each common word's position information 
so as to detect plagiarism in a fine granularity. SSK is based on semantic 
density that is indeed the local word frequency information. We believe these 
measures diminish the noise of rewording greatly. We test SSK in a small 
corpus with several common copy types. The result shows that SSK is excellent 
for detecting non-rewording plagiarism and valid even if documents are 
reworded to some extent. 



1 Introduction 

In this paper we propose a novel Semantic Sequence Kin (SSK) that is based on the 
local semantic density, not on the common global word frequency. And we apply it to 
Document Copy Detection (DCD), not the Text Classification (TC). DCD is to detect 
whether some part or the whole of the given document is the copy of other documents 
(it means plagiarism). However, the word frequency based kernel is not suitable for 
DCD though it is popular in TC. The word frequency model takes mainly global 
semantic features of a document but loses the detailed local features and structural 
information. For example, TF-IDF (Term Frequency - Inverse Document Frequency) 
vector is a basic document representation in TC. But we cannot use TF-IDF vector to 
distinguish two sentences (or sections) that are just the different arrangements of the 
same words, which usually have different meanings. 

By means of matching string, we can exactly find out the plagiarized sentences. 
Indeed, many DCD prototypes [4-6] prefer it. This method first gets some strings 
called fingerprints as text features and then matches the fingerprints to detect 
plagiarism. The string matching model exploits local features of a document mainly. 
It can hardly resist noise, and rewording sentences may impair the detection precision 
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heavily. Therefore, it is better to take aecount of both global and local feature in order 
to detect plagiarism in certain detail and against rewording noise. 

In SSK, We first find out the semantic sequences based on the concept of semantic 
density, which represents the locally frequent semantic features, and then we collect 
all of the semantic sequences to imply the global features of the document. When we 
calculate the similarity between document features, we absorb the ideas of word- 
sequence kernel and string kernel. 

In the next section, we introduce some related work on string kernel and DCD. We 
present our Semantic Sequence Kin in detail in Section 3 and release experimental 
results in Section 4. We discuss some aspects of SSK in Section 5. Finally we draw 
conclusions in Section 6. 



2 Related Work 

Joachims [1] first applied SVM to TC. He used VSM (Vector Space Model) to 
construct the text feature vector, which contains only word global frequency 
information without any structural (sequence) information. Lodhi et al.[3] proposed 
the string kernel method that classifies documents by the common subsequences 
between them. The string kernel exploits the structural information (i.e. gaps between 
terms) instead of word frequency. Before long, Cancedda et al.[2] introduced the 
word sequence kernel that extends the idea of string kernel. They greatly expand the 
number of symbols to consider, as symbols are words rather than characters. 

Now the kernel methods are popular in TC, but we have not found its application to 
DCD. Brin et al.[4] proposed the first DCD prototype (i.e. COPS) that detects overlap 
based on sentence and string matching, but it has some difficulties in detecting 
sentences and cannot find partial sentence copy. In order to improve COPS, 
Shivakumar and Garcia-Molina [7] developed SCAM (Stanford Copy Analysis 
Method), which measures overlap based on word frequency. 

Heintze [6] developed a KOALA system for plagiarism detection. Broder et al.[5] 
proposed a shingling method to determine the syntactic similarity of files. These 2 
systems are similar to COPS. Monostori et al.[8] proposed the MDR (Match Detect 
Reveal) prototype to detect plagiarism in large collections of electronic texts. It is also 
based on string matching, but it uses suffix tree to find and store strings. 

Si et al.[9] built a copy detection mechanism CHECK that parsed each document to 
build an internal indexing structure, which is called structural characteristic (SC), 
used in document registration and comparison modules. Song et al.[10] presented an 
algorithm (CDSDG) to detect illegal copy and distribution of digital goods, which 
indeed combined CHECK and SCAM to discover plagiarism. 



3 Semantic Sequence Kin 

In the following we first introduce some concepts about the semantic sequence, and 
then we propose SSK in detail. 
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3.1 Semantic Density and Semantic Sequence 

Dellnition 1 Let 5 be a sequenee of words, i.e. S = S iS 2 -..S„. We denote the word at 
the position i in S by 5,. The word distance of position i (}<i<n), denoted by afi), is 
the number of words between 5”, and its preceding occurrence Sh- 

a{i) = i-h (1) 

where Sh=Si and (l<h<k<i<n), that is Sh and 5, are the same word and no 
other words is the same with S, between Si, and Sj. If no 5/, exists, i.e. Sj occurs for the 
first time, then (j(i)=co. 

Definition 2 Let 5 be a sequence of words, i.e. S^SjS 2 -.S„. The semantic density of 
position i (I<i<n), denoted by p(i), is the reciprocal of cr(i) : 

p(i) = l/cr(i) (2) 

The semantic density is a kind of word density, but we believe that a sequence of 
word should imply certain semantic information. In fact, a(i) is the distance of 5, to its 
preceding occurrence in the sequence S, and p(i) reflects its local frequency. A 
document is a long sequence of words such that in a given range the small distance 
means the high density of words in a local section. That is to say the smaller distance 
leads to the higher density of words in the local section. We believe that the high- 
density words in some section indicate the local semantic feature of the section. 

Definition 3 Let 5” be a sequence of words, i.e. S=SiS 2 ...S„. A semantic sequence of 
5 is a subsequence of S, denoted by 5[/]=5. 5,^ ...5, with /=[/i,i 2 ,...,L] 

which satisfies the following conditions: 

(1) 1 S[i] I > 1 

(2) 0 < l<k<r 

\<k<r 

(4) (0 < /j - <£)^ < S) 

(5) (0 < ij - 1; <£)^ (p(ij) < S) 

where S and e are user defined parameters. 

In fact, a semantic sequence in 5 is a eontinual word sequence after the low density 
words in S are omitted. In definition 3, the condition 1 guarantees 5'[*] is a non-trivial 
sub-sequence of S. The condition 2 ensures that we sample with text adequately. The 
condition 3 ensures that each word in 5[/] must be dense, i.e. the word must be locally 
frequent. The condition 4 and 5 ensure that the sub-sequence cannot be extended in 
either direction. 

A long S may have several semantics. We denote all of the semantic sequences in a 
document S by Q(S), which then includes the global and local semantic features as 
well as local structural information. However, a single semantic sequence may not 
represent the global feature of the document. For example, we discuss the question q 
in the paragraph P of a document, and q is never mentioned in paragraph other than P. 
If the proportion of P in the whole document is very low, then words about q may 
seldom appear in the global features. But they can be caught in the semantic 
sequences of P. Hence we believe the semantic sequence can detect plagiarism in a 
fine granularity so that we can find n to 1 partial copy well (we explain the n to 1 
partial copy in the Section 4.1). 
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There is a fact for DCD that if one section is shared between two documents, then 
we think they are a copy pair. It is enough for DCD to compare similarity in several 
most likely common sequences. We know that the larger number of common words 
there are between two strings the more similar they are. Therefore, we select 
candidate semantic sequences in U(S) and Q(T) according to the number of common 
words between them. 

Let 5'[/]n7’[7] be the set of common words between 5[/] and T\j], Let 
CL(5,7)=[(5'[//],7]7^]),...,(5'[/„],r[/„])] be the list of semantic sequence pairs on 
document S and T, which is sorted by | fl ^[y^] | ,(7<A:<n) descendingly. We 

denote the first rf semantic sequence pairs by CL^(S,T). We denote the set of all the 
common words between semantic sequence pairs in CL^(S,T) by CP{S,T) \ 

CP{S, T) = U (S[i, ] n T[j, ]), (5[/, ], T[j, ]) e a (S, T) (3) 

k=l 

The words in CL^(S,T) are not only common words between S and T but also 
locally frequent words. We believe that CL^{S,T) reflects the local identity between 2 
documents in some detail. According to the fact that the more common words there 
are between documents, the more similar they are, |CP(5,7)| can measure document 
similarity. However, when two documents share the same words list, they must be 
very similar but may not be identical. The different arrangements of the same words 
list often represent different documents on the same topic, namely, they are similar 
and belonging to the same category, but not identical. |C/’(5,7)| regards all possible 
arrangements of one words list as the same so that it makes a high positive error. In 
order to discriminate the different arrangement of the same words list, we add 
structure infonnation of strings in SSK. 



3.2 Semantic Sequence Kin 



String kernel and word sequence kernel calculate the dot product of two strings based 
on gaps between tenns/words, i.e. /(/). The /(/) cannot exactly reflect different 
arrangement of the same words list because it considers only the first and the last 
term/word in the list, not the others. If we take into account the position of each word, 
then we can detect plagiarism more precisely. This is the basic idea of SSK. 

The Semantic Sequence Kin of two semantic sequences 5[/] and T\j\ is defined as: 

(4) 



W l^mi + IUy] 

where xt is the difference of a common word's word distances between 5[/] and T\j], 
that is: 



A ) I > e S[/] n r[y] (5) 

where is the word distance of in S and cr(/\,) is the word distance of yV in T. If a 
coimnon word occurs twice or more times in 5'[/] (or T[/]), then in S[/] fl r[y] we 
only save the word that occurs first with its position in 5[/] (or T\j\), and omit the 
other occurrences. For example, there are six semantic sequences as follows: 

Si: ....ABC ; S2:....ABC ABC ; S3: ....ABC BCA ; 
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T,: ....C B A ; T2: ....C B A C B A ; T3: ....C B A A B C ; 

Then K(Si,Ti)=K(S2,T2)=K(S3,T3). This could disturb SSK trivially. However, we 
ignore it and consider it as a noise. 

In order to keep the kernel values eomparable and independent from the length of 
the strings, we normalize the kernel as follows: 

( 6 ) 



K(S[ilT[j]) = 



yjK{S[ilS[i])K(T[jlT[j]) 



Indeed, V g S[/]n 5[/] = S[/], 



is[;]ns[;]| i^[i]ns[<]| 

k=l k=l 



similarly,.?:(7I/],7I/]) = |7I/]| 

••• Kisuinn) 



Vi s[i] II T[j] 



( 7 ) 



We give the Semantie Sequence Kin of a single semantic sequence pair above. For 
a doeument pair S and T, we may select several semantic sequence pairs in order to 
improve accuracy, i.e. CL^{S,T), rj> \ . Thus, we define Semantic Sequence Kin of 
document pair {S,T) as: 



K{S,T)^-Yk{S\i,],T[j,]), {S[i,lT[j,])^CL^{S,T) (8) 

n r=i 

It is obvious that K{S,T) g [0,1] . For document pair (S,T), we use the discriminant 
below to make decision. 

f(S,T) = sign(aK(S,T) + b), a,b^R,a^0 (9) 



4 Experimental Results 

In this section, we do some experiments to test our Semantic Sequence Kin and 
compare it with Relative Frequency Model (RFM) [7], which was successfully used 
to track real life instances of plagiarism in several conference papers and journals 
[12]. RFM uses an asymmetric metric to find out subset copy, but it is not so valid for 
n to 1 partial copy. In order to contrast error trend of SSK with that of RFM, we vary 
the discriminating threshold manually, that is: 

f(S,T) - sign(aK{S ,T) + b)= sign{a)sign(K{S ,T) + b / a), a,b ^ R,a ^ 0 
If we let a>0, theny(iS', T) = sign{K{S, T)+bla). Let t= bla, called discriminating 
threshold. When we vary the value of r, we will get different detection error. 



4.1 Test Corpus 

We collect 75 full text literatures in PDF format from Internet, which can be 
downloaded from ftp://202.1 17.15. 227/dcd_eorpora.zip. These are original documents 
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from which we make our plagiarized documents. (1) Exact self-copy (D-copy); these 
copies are just renamed source files. (2) Self cross co^y{0-copy): first we divide each 
source file into 10 blocks and then we randomly reassemble these blocks into a new 
copy. (3) Subset copy( E-copy): we select 2 files from the source files and divide each 
of them into 10 blocks, at last we randomly reassemble these 20 blocks into a new 
copy. It is obvious that the 2 source files are subsets of the copy. (4) N to 1 partial 
copy(X-copy): we select n (we set it to 5 in our experiment) files from the source files 
and divide each of them into 10 blocks, then we randomly select k (it is 2 in our 
experiment) blocks from each 10 blocks, at last we randomly reassemble all selected 
blocks into a new copy. The copy contains small parts of several source files but none 
of them is subset of the copy. (5) The 4 types above are non-rewording copy types. 
We reword each word of the non-rewording copy files in certain probability to get the 
respective rewording files. We use Martin Porter’s stemming algorithm and procedure 
[11] to stem words and then remove the stop words'. Finally, we make a document 
pair with a copy file and a source file. 

We define the positive error as the proportion of non-copy document pairs (no 
plagiarism) above the threshold x in the whole non-copy document pairs, i.e. 

I 









( 10 ) 



where <I> is some type of document pairs, and {^n}^ is those non-copy pairs of type 
0 whose plagiarism score is greater than or equal to x. The negative error is the 
proportion of copy document pairs (containing plagiarism) below the threshold x in 
the whole copy pairs, i.e. 



* l®c 

where {^c}<t is those copy pairs of type <1> whose score is less than x. 



( 11 ) 



4.2 Contrasting Experiments 

We have mentioned that we can use |CP(5,7)| to distinguish plagiarism, and we call 
this approach as Semantic Sequence Model (SSM). The SSM plagiarism score of 
document S and T is qssiJ,S,T). 

= ,1} ( 12 ) 

at] 

Figure 1 shows the error trends of SSM, RFM and traditional VSM on the whole 
non-rewording corpus. We see that the negative error of SSM is very low and fiat but 
its positive error is high. Figures 2 show their error trends on X-copy corpus. The 
negative errors of RFM and VSM increase rapidly while that of SSM is still small. It 
illustrates that RFM and VSM are futile on the X-eopy corpus whereas SSM is valid. 
SSM gains low negative error at the expense of high positive error. 

Figure 3 shows the error trends of Semantic Sequence Kin (SSK), SSM and RFM 
on the whole non-rewording corpus. We see that the positive errors of SSK are always 



' Available at http://ls6-www.informatik.uni-dortmund.de/ir/projects/freeWAIS-sf/ 
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Fig. 1. VSM, RFM and SSM error on the 
whole non-rewording corpus 



Fig. 2. VSM, RFM and SSM error on the X- 
copy corpus 



near 0 while its negative errors increase slowly. The negative errors of SSK are a little 
bigger than that of SSM. SSK gains very low positive error at the expense of a little 
increase of negative error. Whatsoever, SSK is superior to RFM. 

Figure 4 shows the error trends of SSM and RFM on the rewording corpus with 
each word rewording probability^ 9 = 0 . 1 . From Figure 4, we see that the negative error 
of RFM on rewording corpus is far larger than that of SSM. It implies that RFM 
detects fewer plagiarisms than SSM, that is, RFM will lose more cases of plagiarism 
than SSM. Figure 5 shows the error trends of SSK with the same rewording 
probability. Figure 6 shows SSK error trends with different rewording probabilities 
(61=0.2,0.4,0.6,0.8). We find that when the rewording probability increases, the 
positive errors of SSK are stable and the negative errors of SSK increase a little. 
Whatsoever, we can get the appropriate values of t to keep both the positive and 
negative error in a tolerable range. It proves that SSK is valid for DCD even if the 
document is reworded to some degree. 



5 Discussions 

From the experiments we find that the positive error and the negative error are a pair 
of contradiction. That is to say a low positive error will lead to a high negative error 



^ We use JWNL (available on http://sourceforge.net/projects/jwordnet) to reword each word in 
document. Because the synonym set of a word contains the word itself, the real rewording 
probability is lower than the labeled value. 
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Fig. 3. SSK, RFM and SSM errors on the 
whole non-rewording corpus 
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Fig. 4. RFM and SSM errors on the 
rewording corpus with rewording 
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Fig. 5. SSK errors on the rewording Fig. 6. SSK errors on the rewording 

corpus with rewording probability 0=0.7 corpus with 0=0.2, 0.4, 0.6, 0.8 
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and vice versa. Therefore we have to make a trade-off between the positive and 
negative error. The negative error of SSM is the lowest, which leads to its highest 
positive error. We use SSK to make the positive error very low at the expense of 
increasing negative error a little. Whatsoever, SSM and SSK are superior to RFM and 
traditional VSM. 

Both SSK and SSM confonn to the principle that the more coimnon words there 
are between documents, the more similar they are. Additionally, SSK satisfies 
stronger structural condition, i.e. the common words' word distance must be similar, 
otherwise the word will be penalized. On the one hand, SSM is superior to SSK on 
negative error because it misses out fewer plagiarisms than SSK, although many 
documents are mistakenly detected to involve plagiarism. On the other hand, the high 
SSK score means that one or more word sequence(s) must be almost the same 
between the documents, whieh is just what plagiarism means. Thus, SSK seldom 
mistakes non-plagiarism for plagiarism so that its positive errors are very low. 

However, the rewording action may not only reduce the common words but also 
disorder the words' sequence, which raises the negative error of SSK and SSM, so 
they may miss some plagiarized documents. Interestingly, the rewording aetion 
decreases the probability of mistaking some doeuments for plagiarism such that the 
positive error deelines while rewording probability rises. 

In another respect, while the discriminating threshold (t) is increasing, the positive 
error trend is declining and negative error trend is increasing. The higher value of r 
means that we need more common words to confinn plagiarism, so that we may 
decrease mistakes in plagiarism detection. Consequently, when the value of r is low, 
we will get low negative error and high positive error, and when the value of r is high, 
we will get high negative error and low positive error, whieh causes the optimal value 
of T of SSM larger than that of SSK. Altogether, the low r makes system radical, 
which misses out fewer plagiarism eases but easily misjudges the similar doeuments 
as plagiarism. In contrast, the high t makes system conservative, whieh catches 
serious plagiarism but may fail to deteet many light or partial copy documents. 



6 Conclusions 

We extend the idea of SK and WSK to make SSK. SSK compares two semantic 
sequences aceording to their common words and position information so that we can 
detect plagiarism in a fine granularity. SSK eombines the features of word frequency 
model and string matehing model, which are cormnonly used in traditional DCD. So 
SSK is superior to traditional DCD models, such as VSM and RFM et al. Especially 
in non-rewording corpus, both the positive errors and the negative errors are lower 
than traditional models. For a given model, if we decrease the positive error by some 
means, then the negative error must increase and vice versa. In order to get the 
optimal result, we have to make a trade-off between positive error and negative error. 
The rewording aetion may not only change the possible eommon words between 
documents, but also influence the loeal struetural information, which impairs the 
SSK's performance greatly. Hence our next goal is to increase the deteetion accuracy 
in rewording corpus by means of adding some rewording probability in SSK. 
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Abstract. Scientific research reports usually contain a list of citations on previ- 
ous related works. Therefore an automatic citation tool is an essential compo- 
nent of a digital library of scientific literatures. Due to variations in formats, it 
is difficult to automatically transform semi-structured citation data into struc- 
tured citations. Some digital library institutes, like Researchindex (CiteSeer) or 
OpCit, have attempted automatic citation parsing. In order to recognize meta- 
data, e.g., authors, title, journal, etc., of a citation string, we present a new 
methodology based on protein sequence alignment tool. We also develop a 
template generating system to transform known semi-structured citation strings 
into protein sequences. These protein sequences are then saved as templates in 
a database. A new semi-structured citation string is also translated it into a 
protein sequence. We then use BLAST (Basic Local Alignment Search Tool), a 
sequence alignment tool, to match for the most similar template to the new 
protein sequence from the template database previously constructed. We then 
parse metadata of the citation string according to the template. In our experi- 
ment, 2,500 templates are generated by our template generating system. By 
parsing all of these 2,500 citations using our parsing system, we obtain 89% 
precision rate. However, using the same template database to train ParaTools, 
79% precision rate is obtained. Note that the original ParaTools using its de- 
fault template database, which contains about 400 templates, only obtains 30% 
precision rate. 



1 Introduction 

It is difficult for a computer to automatically parse citations because there are a lot of 
different citation formats. Citations always include author, title, and publication in- 
formation. Publication information format varies according to publication type, e.g., 
books, journals, conference papers, research reports, and technical reports. Publica- 
tion information can include publication name, volume, number, page number, year 
published, month published, and publisher’s address. Citations can be presented in 
either structured or semi-structured form. Semi-structured citation form is more flexi- 
ble, so bibliographies created by different people may have different citation forms. 
Metadata order may be different as well as their attributes. Bibliographies on the 
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Internet are usually in semi-structured form. If we want to use their data, we must first 
transform the semi-structured bibliography into structured bibliography. We have to 
analyze the metadata of each citation, and build up an index for bibliography searches 
and citing statistics. In this paper, we discuss how to transform semi-structured bibli- 
ographies into uniform structured data, the core problem of citation data processing. 

CiteSeer [1][2][3][4], which use heuristics to extract certain subfields, can “find 
the titles and authors in citations roughly 80% of the time and page numbers roughly 
40% of the time” [1]. Another system, ParaTools [5][6][7][8] (short for ParaCite 
Toolkit) is a collection of Perl modules used for reference parsing. It uses a template- 
based reference parser to extract metadata from references. Our approach is similar to 
ParaTools in that we also use a template-based reference parser, but we found a better 
alignment from BLAST [9] [10]. 

There are about 30 billion nucleotides in a human genome, and for about every 
1,000 base pairs, there will be a nucleotide difference in genomes. We can use 
BLAST to compare sequence to identify whether the sequences belong to the same 
person. We realized that we could use this method to identify citation. In our system, 
we use a form translation program to translate citations into a form that we can proc- 
ess more easily. The form we use is a protein sequence because we can use BLAST, a 
well-developed protein sequence matching program, to process it. BLAST needs a 
scoring table to search for most similar sequences in a protein sequence database. 
This database stores the templates of known citations that have been translated into 
protein form. After finding the most similar sequence template, we use a pattern ex- 
traction program to parse the citation metadata according to the template. Once the 
metadata are correctly parsed, we manually validate them and add them into our 
knowledge database. 

The rest of the paper is organized as following. Section 2 is system architecture 
and design. Section 2. 1 introduces the BLAST. Section 2.2 defines the tokens that we 
use. Section 2.3 tells how the tokens translated to amino acids. Section 2.4 and 2.5 are 
the template generating system and citation parsing system. In section 3, we do some 
experiment here. Section 4 is conclusion and future works. 



2 System Architecture and Design 

As shown figure 1, we can divide our system into two subsystems: a template gener- 
ating system and a citation parsing system. For the template generating system, we 
manually crawl BibTeX files on the web to obtain our template generating data. From 
the BibTeX file, our system can find out the titles, and then use these titles automati- 
cally obtain semi-structured citations from the CiteSeer web site. Now we have both 
semi-structured data from CiteSeer and structured data from BibTeX files. Since the 
semi-structured and structured data are the same but presented in different forms, this 
system can use these two forms to automatically make a template database. After the 
template database is constructed, the system begins the parsing process. In the parsing 
system, BLAST is used to compare strings. We transform a citation into protein form, 
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and use BLAST to search for the most similar sequence in the template database. We 
can then parse the citation according to the template that BLAST finds. 




Fig. 1. The architecture of this system. 



2.1 BLAST 

BLAST is a similarity search tool developed hy Altschul et al. (1990), it is based on 
dynamic programming. It is used to search for optimal local alignments between 
sequences. BLAST breaks the query and database sequences into fragments (words). 
It searches the matches for the word of length W that scores at least T. Matches are 
extended to generate an MST(maximal sequence pair) with a score exceeding the 
threshold of S. The quality of each pair-wise alignment is represented as a score. 
Scoring matrices are used to calculate the score of the alignment amino acid by amino 
acid. The significance of each alignment is computed as a P-value or an E-value [10]. 
P-value is the likelihood that two random sequences will have an MST with a score 
greater than or equal to S. E- value is the expected number of MSTs to be found with 
score S or higher in two random sequences. 

2.2 Tokens 

Before explaining the architectures of the template generating and citation parsing 
systems, we have to identify the tokens they use. We use regular expression to pre- 
cisely define the tokens. Tokens are classified into three types: numerical, general, 
and punctuation as follows: 

• Numerical tokens: [0-9]+ 

We reclassify the numerical token again as a year number, or general num- 
ber. The regular expressions are as follows: 

- Year: [12][0-9][0-9][0-9] 

General number: Otherwise 



542 



I-A. Huang et al. 



• General tokens: [a-zA-Z]+ 

We reclassify the general token to key words that often appear in citations, 
like page, number, volume, month, name or unknown. The regular expres- 
sions are as follows: 

Number: 

[Nn][Oo] 

|[Nn][Nn] 

I [Nn][Uu][Mm][Bb][Ee][Rr] 

Page: 

[Pp][Pp] 

I [Pp][Aa][Gg][Ee]([Ss])? 

Volume: 

[Vv][Oo] 

I [Vv][Oo][Ll]([Uu][Mm][Ee])? 

Month: 

[Jj][Aa][Nn]([Uu][Aa][Rr][Yy])? 

I [Ef] [Ee] [Bb]([Rr] [Uu] [Aa] [Rr] [ Yy])? 

I [Mm][Aa][Rr]([Cc][Hh])? 

I [Aa][Pp][Rr]([Ii][Ll])? 

I [Mm][Aa][Yy] 

I [Jj][Uu][Nn]([Ee])? 

I [Jj][Uu][Ll]([Yy])? 

I [Aa][Uu][Gg]([Uu][Ss][Tt])? 

I [Ss] [Ee] [Pp] [Tt]([Ee] [Mm] [Bb] [Ee] [Rr])? 

I [Oo][Cc][Tt]([Bb][Ee][Rr])7 
I [Nn] [Go] [ Vv]([Ee] [Mm] [Bb] [Ee] [Rr])7 
I [Dd] [Ee] [Cc]([Ee] [Mm] [Bb] [Ee] [Rr])7 

Name: We have a database that stores about 2,000 name tokens. If the 
token appears in the name database, it is identified as a name token. 
Unknown: If the general token is not classified above, it is classified 
as unknown. 

• Punctuation tokens: [\"V.\(),:;\-!\?] 



2.3 Protein Sequence Translation 

In order to identify tokens, we use regular expression to describe the tokens in the last 
section. We can translate the token into a protein sequence according the following 
rules: 

Y : Match to Y ear 
N: Match to General number 
S: Match to Number 
P: Match to Page 
V: Match to Volume 
M: Match to Month 
A: Match to Name 
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NULL: Match to Unknown 
G: Match to Punctuation token “ or ‘ 

D: Match to Punctuation token . or ; or ? 

I: Match to Punctuation token ( 

K: Match to Punctuation token ) 

R: Match to Punctuation token , 

Q: Match to Punctuation token : or - or ! 

When the tokens translate to the amino acids, the combinations of the amino acids 
become a protein sequence. We call this sequence a prototype protein sequences, and 
use it to represent the citation. We can use this sequence to search for the most similar 
template in the database. We also use it to construct the template database. The cita- 
tion in Figure 2(a) can transform into Figure 2(b) by translating the tokens according 
to the rules described above. 

2.4 Template Generating System 

In the template generating system, we construct a template database that contains the 
templates that represent most citation formats. Because BibTeX files are widely used 
in bibliographies, we retrieve BibTeX files from the Internet. BibTeX format was 
designed by Oren Patashnik and Leslie Lamport. It is field based, so we can parse the 
data of each field easily. We use the title field to search the citations in CiteSeer. We 
get the metadata of the citations found on the web by parsing the corresponding Bib- 
TeX files to construct the templates. We begin with the method described in section 
2.3 to create the prototype protein sequence. Since we have found the metadata of 
each field, we can find the data in the CiteSeer citation and then modify the sequence 
by adding A (author), L (journal), and T (title) into the correct position in the se- 
quence. We then change N to their corresponding amino acids. An amino acid N may 
become F (value of volume) or W (value of number). This is illustrated in Figure 
2(c). It is now a completed protein sequence, and the template is finally constructed. 
We store this template in the template database. 

2.5 Parsing System 

After transforming the citation into a prototype protein sequence, we search for the 
most similar sequence in the template database by using BLAST. The parsing system 
is like a reverse process of the template generating system. After we find a template, 
we use it to extend the prototype protein sequence into a complete protein sequence. 
Now the metadata can be parsed. If we want to parse the citation shown in Figure 
2(a), we must transform it into prototype protein sequence as figure 2(b). By 
entering this sequence into BLAST, we find the template 
AAAAAAAAARGTTTTTTTTTGRLLLLLLLRVDFRSDWRMDYD. In this tem- 
plate, author is in the first position. The field of author and the field of title are sepa- 
rated by RG. (punctuation marks), as are the title and journal fields. Modifying Figure 
2(b) according this template, we get Figure 2(c). Then, by checking the original cita- 
tion, we can parse out all the metadata correctly. 
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Yieh-Ran Huang and Jan-Ming Ho, “Distributed Call Admission Control for a Heterogeneous PCS Network” 
, to appear IEEE Trans. On Computers, Vol. 51, no. 11, Nov. 2002. 

Fig. 2(a). The original citation we want to parse. 
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Fig. 2(b). The citation in the Fig. 2(a) transforms into its protein sequence as 
QAMQARGGRDRVDNRSDNRMDYD. Each blank in the table represents a token in the 
citation. 



A 


A 


A 


A 


A 


A 


A 


A 


A 


R 


G 


T 


T 


T 


T 


T 


T 


T 


T 


T 


G 


R 


L 


L 


L 


L 


L 


L 


L 


R 


V 


D 


F 


R 


S 


D 


W 


R 


M 


D 


Y 


D 



Fig. 2(c). We transform the citation sequence into its protein sequence. The template of the 
citation is AAAAAAAAARGTTTTTTTTTGRLLLLLLLRVDFRSDWRMDYD 



3 Experiment Results 



Ideally, all of the metadata in the BibTeX should be consistent with the metadata in 
the citations searched from CiteSeer. Unfortunately, only a few data are consistent. 
We experimented with 2,500 article citations, but only 100 citations were consistent 
with BibTeX. To overcome this problem, we create a precision evaluation method to 
test whether the data is correctly parsed. We define the precision of each subfield as: 



Subfield precision = 



#\(Tn kpn + Tn kpn o TV) kpn 1 

Number ^ )' 1 

# if TV) kpn 4 - Tn kpn V o TV) kpn 1 

Number ^ ^ BibTeX 1 



( 1 ) 



: denotes the number tokens that appear in the parsed subfield. 

: denotes the general word tokens that appear in the parsed subfield. 

Token : denotes the tokens that appear in the specific subfield in the BibTeX 
file. 

Token : denotes all of the tokens that appear in the BibTeX file. 



The denominator of subfield precision represents the number of the tokens which 
exist both in the citation file and in the BibTeX file. The numerator of the subfield 
precision represents the number of the tokens which are correctly parsed. We then 
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define the total precision of a citation as the average of subfield precisions. Using the 
2,500 templates generated by our template generating system as the template database 
and parsing all of the 2,500 citations using our parsing system, we obtain 89% preci- 
sion rate. Using cross-validation to validate the precision of our system, we divide the 
citations into 10 subsets, and get an average precision rate of 84.8%. The precision 
distribution chart of the first 100 citations is shown in Figure 3. Parsing results are 
adequate because 85% of the citations achieve a precision rate of 70% or higher. 
Although the precision calculated here is not actually precision, the disparity between 
this precision here and actual precision is small. The actual precision of the parsed 
result is roughly 80%. BLAST needs a scoring table to evaluate the alignment result. 
In our system, we use a scoring shown in Figure 4. We had also tried a lot of different 
scoring table to parse the citations. The diagonals of all the various scoring matrices 
were always positive, and the variation in the precision of the parsing results because 
of a particular choice of scoring table was less than 3%. Both ParaTools and our sys- 
tem are template-based reference parsers. The completeness of the template database 
is an important factor to template-based parsing systems. We illustrate the effect of 
template completeness on precision in Figure 5. Our system performs better than 
ParaTools for all tested template completeness. In Table 1, we present the quality of 
parsing result. Our quality is better than ParaTools’. The parsing results will be used 
in the future. If the quality is not good, it is insignificant for reusing. 




Fig. 3. Precision rates of the first 100 citations. The X axis is the assigned number of one cita- 
tion, and the Y axis is the precision of that citation. 
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Fig . 4 . The scoring table we use this scoring table to evaluate the alignment score of the protein 
sequences 
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Fig . 5. For varied template database, different precision rates are achieved. 
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Table 1. Divide the data into k sets, and do cross-validation. The quality of the parsing result is 
defined good if the precision rate of the parsing result is better than 70%. The good results also 
contain perfect results which precisions rates are 100%. 

Quality of the parsing result 

Percent 



Value of k 


Description 


OpCit 


Our svstem 


10 










Perfect 


21 


71 




Good (>70%) 


26 


85 




Not 2oodf<70%l 


74 


15 


5 










Perfect 


21 


59 




Good(>70%) 


26 


76 




Not 2oodl<70%l 


74 


24 


2 










Perfect 


6 


37 




Good(>70%) 


6 


52 




Not good(<70%) 


94 


48 



4 Conclusion and Future Works 

It is flexible for a template-based system to deal with citations. We not only can add 
new citation templates easily, but also can search the most similar template to rapidly 
parse the metadata. It is also shown that precision of parsing result is different for 
various levels of completeness of template database. ParaTools contains about 400 
templates in the system, but it does not fit the 2,500 data well. The precision rate for 
ParaTools to run the data is only 30%. We also demonstrate that our system still per- 
forms better in the same set of templates. This is contributed by the power of the 
string comparison tool, BLAST, and our design of rewriting rules to transform the 
citation parsing problem to a protein sequence alignment problem. In the future, we 
will generate more templates to match a broader range of citation formats. Our study 
also suggests that it is a promising research direction to develop BLAST-based solu- 
tions for some other template matching problems. 
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Abstract. This paper presents a new prediction model for predicting when an 
online customer leaves a current page and which next Web page the customer 
will visit. The model can forecast the total number of visits of a given Web 
page by all incoming users at the same time. The prediction technique can be 
used as a component for many Web based applications . The prediction model 
regards a Web browsing session as a continuous-time Markov process where 
the transition probability matrix can be computed from Web log data using the 
Kolmogorov’s backward equations. The model is tested against real Web-log 
data where the scalability and accuracy of our method are analyzed. 

Keywords; Web mining. Continuous Time Markov Chain, Kolmogorov’s 
backward equations, Sessions, Transition probability 



1 Introduction 

Web mining is a thriving technology in the practice of Web-based applications. By 
applying data mining technologies such as clustering, association rules and discrete 
Markov models, Web mining has been successfully applied to Web personalization 
and Web-page pre-fetching. Study has shown that the next page an online customer is 
going to visit can be predicted with statistical models built from Web sessions [4, 5]. 
However, an open problem is when an online customer will click on a predicted next 
page. A related question is how many customers will click the same page at the same 
time. In this paper, we answer these questions from a perspective of Markov models. 

Markov models are one of major technologies for studying the behaviors of Web 
users. In the past, discrete Markov models are widely used to model sequential 
processes, and have achieved many practical successes in areas such as Web-log 
mining. The transition matrix based on the Markov process can be computed from 
visiting user-session traces, and the Frobenius norm of the differences of two 
transition probability matrices can show the difference of the two corresponding 
sequences [4]. Users can be clustered by learning a mixture of the first-order Markov 
models with an Expectation-Maximization algorithm [7]. The relational Markov 
model (RMMs) makes effective learning possible in domains of a very large and 
heterogeneous state space with only sparse data [11]. 
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In this paper, we present a continuous-time Markov model for the prediction 
task. In contrast with the previous work which uses the prediction rules [5], we use 
the Kolmogorov’s backward equations to compute the transition probability from one 
Web page A to another Web page B according to the following steps; first, we 
preprocess the Web-log data to build user sessions; second, we compute the transition 
rate of a user leaving a Web page A and obtain a transition rate matrix; third, we 
compute the transition probability matrix using the Kolmogorov’s backward 
equations. From the transition probability matrix we can predict Web page a user will 
visit next, and when the user will visit the page. Furthermore, we can find the total 
transition count from all other Web pages to the predicted Web page at the same time. 

The main contribution of this work is to put forward two hypotheses for 
computing a transition probability model from one Web page to another. The first 
hypothesis treats Web browsing sessions as a continuous time Markov process. The 
second hypothesis regards the probability of leaving a Web page as having an 
exponential distribution over the time. We can compute the transition rate for leaving 
the current Web page with the second hypothesis, and then according to the first 
hypothesiswe compute the transition probability matrix by the Kolmogorov’s 
backward equations. 

The paper is organized as follows. Section 2 presents the prediction model. 
Section 3 presents the experiment results. Section 4 concludes the paper. 



2 Continuous Time Markov Chains for Prediction Models 

A Web log often contains millions of records, where each record refers to a visit by a 
user to a certain Web page served by a Web server. A session is a set of ordered Web 
pages visited in one visit by the same visitor at a given time period. 

A sequence of pages in a user session can be modeled by a Markov chain with a 
finite number of states [5, 9]. In discrete Markov chains, we need to consider the 
minimal time interval between page transitions, which is not easy to predict. In this 
paper, we propose to use continuous-time Markov chains for predicting the next 
visiting web page and when, and the total transition count from all other Web pages to 
the predicted Web page, instead of using discrete Markov chains. As a result, we can 
manage different time intervals in which to visit Web pages. 



2.1 Continuous Time Markov Chains 

A stochastic process is called a continuou'time Markov chain at state i 

• The amount of time it spends in state i, before making a transition to a different 
state, is exponentially distributed with rate v,, and 

• It enters the next state j from state i with probability P^j, where P.. = 0 and . P.. = 
1 for every possible state j in the state space. 

Let { Affj, f > 0 } be a continuous-time Markov chain of a user browsing Web 
pages. Here the state of a continuous-time Markov chain refers to a Web page and the 
state space contains all possible Web pages, which is finite but large. The main 
characteristics of a continuous-time Markov chain is that the conditional distribution 
of the next Web page to be visited by a user at the time t+s is only dependent on the 
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present Web page at the time s and is independent of the previous Web pages. For 
simplicity, we let 






In a small time interval h, 

• P„ih) is the probability that the process in state i at time 0 will not be in state i at 
time h. P.(h)= v.h + o(h) is equal to the probability that a transition occurs in (0, 
h). 

• P..{h)=hv^Pij + o(h) is the probability that a transition occurs in (0, h) from state i 
to state j. 

The limits of (1- P.. (h)) / h and P.j (h) / h are equal to v.and v.P.. respectively as h 
approaches zero. 

In order to predict the time at which a user will visit the next Web page, we 
make use of the well-known equation Kolmogorov’ s Backward Equations [1] in the 
continuous-time Markov chain: 



P.. {t) = V. X 
IJ I 



fc ^ I 



v.xP..(t) 

I Ij'- 



The transition rate matrix R{ r,^ }with its entries is defined as 

T.= Pj if i ^ J 

fij= -V. if i=j- 

The Kolmogorov backward equations can be rewritten as 

PT(0 = r r,P,(t). 

We can write it in the matrix form as 

P\t)=RP{t). 

This is a system of ordinary differential equations and the general solution is 
given by 

P{t) = P'. 

The above formula gives the probability of the next Web pages to be visited at 
time t. We need to compute the matrix R in order to use the continuous-time Markov 
chain for the prediction. 



2.2 Computation of the Transition Rate Matrix R 

According to the definition of P,. and v, in the previous subsection, the term v, is the 
rate at which the Markov process leaves Web page i and P.. is the probability that the 
Markov process enters Web page j. 

Let us use the popular NASA Web-log data set to demonstrate how to compute v. 
and Py. To illustrate, we extract a URL which index is 2327. Firstly, we compute the 
time t.. (j=l,2,...,n) for all people who visited the Web page i and entered another 
page j after the visit, where n is the number of visiting next Web pages from Web 
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page i. Then we sort all t.j Next we compute the accumulative frequency 

N.^ of visits of Web page i up to time t.^. We note that is the accumulative 
frequency of visits of Web page i at the time the last visitor leaves. Therefore, we can 
calculate the probability of leaving aWeb page i as N.^ /N. at time 



LFLJ d=2327 




Fig. 1. The probability distribution of 
Leaving a Web page with time 




Fig. 2. The Deviation of the Actual Count 
Visiting Page 2334 from the predicted count 



Figure 1 shows that the probability leaving from a Web page is exponentially 
distributed. By considering the statistical hypothesis in the continuous-time Markov 
chain, we model the curve in Figure 1 by an exponential distribution function as 
follows: 

F{t)=l-e^\ t 0 and F(t)=0, t<0 

Where X is equivalent to the transition rate v, of the continuous-time Markov chain 
[1]. ^ can be determined as follows: 

A = -(ln(l-F{t))/t. 

Based on this formula, we estimate the transition rate v, as follows: 

v.^= - {In ( l-Ni^/Nj) )/ m=l,2,...,n-l 
L=(Lr+L2+--+ 

For each Web page, we employ the same procedure to obtain the estimates of the 
transition rate v,. Finally, the probability P.. of leaving Web page i to Web page j can 
be estimated from the data by counting the relative frequency of the next visit to Web 
page j from Web page i. Finally, we obtain the transition rate matrix R from P.^ and v. 
by the follow formula. 

'■„= P,i if i ^ j 
T;= -L if i=J- 
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3 Empirical Analyses 

3.1 Experimental Setup 

The experiment was conducted on the NASA data set, which contains one month 
worth of all HTTP requests to the NASA Kennedy Space Center WWW server in 
Florida. The log was collected from 00:00:00 August 1, 1995 through 23:59:59 
August 31, 1995. We filtered out documents that were not requested in this 
experiment. These were image requests or CSS requests in the log that were retrieved 
automatically after accessing requests to a document page containing links to these 
files and some half-baked requests [5]. 

We consider the Web log data as a sequence of distinct Web pages, where 
subsequences, such as user sessions can be observed by unusually long gaps between 
consecutive requests. To save memory space, we use number IDs to identify Web 
pages and users. To simplify the comparing operation, the time has been transformed 
to seconds starting from 1970. 

In deciding on the boundary of the sessions, we make up two rules as follows. 

In a user session, if the time interval between two consecutive visiting is larger than 
1800 seconds, we consider the next visit starting a new session. If a user has two 
consecutive records visiting the same Web page in 1800 seconds, we consider the 
next visit to be a new session. We loaded all sessions into a session data cube and 
operated the cube to extract the set of sessions for building the continuous-time 
Markov chain as described in Section 3. 



3.2 Computing the Transition Probability from One Web Page to Others 

After the continuous-time Markov chain for visiting Web pages was built, we used 
formula P(t) = e* to calculate the probability of entering another Web page from the 
current page. The Matrix P{t) is estimated by: 

P(t) = e^^ = Urn (I + Rt/nf 

« — > oo 

To obtain the limiting effect, we raised the power of the matrix I+Rt/n to the nth 
power for sufficiently large n. 



3.3 Predicting Which Next Pages People Will Visit, and When 

We ran experiments to test the validity of the prediction model. Tracing through the 
NASA data set, a person visiting the Web page 2359 will decide where to go next. 
We make a prediction on when and where this visitor will go next, using our model. 

Let us set the start time when the person was visiting the Web page 2359 as zero 
second. We count the transition count from the page 2359 to other pages before the 
start time of prediction. A total of 19 pages were visited by the person who left the 
page 2359 before the zeroth second; these pages are listed in Table 1, where is the 
transition count from Web page i=2359 to Web page j that actually happened in the 
data set before the zeroth second. 
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Table 1. The transition parameter from the Web page 2359 to other Web pages in 600 seconds. 



URLJd. 


2324 


2442 


2383 


2364 


2334 


2362 


2969 


2333 


2358 


2496 


N„ 


1 


1 


2 


1 


32 


1 


5 


7 


1 


2 


P„ 


0.0093 


0.0013 


0.0493 


0.0111 


0.0367 


0.0343 


0.0049 


0.0193 


0.0650 


0.0714 





1 


1 


2 


1 


33 


1 


5 


7 


1 


2 



URLJd. 


2375 


2344 


2325 


2372 


2657 


2374 


2393 


2370 


2327 




N„ 


2 


1 


3 


1 


2 


1 


1 


1 






P„ 


0.0237 


0.0130 


0.0234 


0.0321 


0.0018 


0.0168 


0.0011 


0.0026 


0.0259 




Nfl 


2 


1 


3 


1 


2 


1 


1 


1 








Fig. 3. The P. x W. value at 600 second 

With our model, we can compute the transition probability values from Web 
page 2359 to all other Web pages at time t. When the time t is 600 seconds, Figure 3 
shows the value of P.. x where the integer part is the predicted transition count 
from Web page i to Web page j during a given time interval. The value of int(P^j x N.^ 
at Web page 2334 increases up to one, and we can predict that the next page to go 
from Web page 2359 is Web page 2334 and the transition time is within 600 seconds. 

We use N'y to denote the total transition count from Web page 2359 to another 
Web page j that actually happened in the NASA data set before the end of prediction 
time. In Table 1, N'.j - N^j is one at Web page 2334. The person left Web page 2359 to 
Web page 2334 in 600 seconds in reality. 



3.4 Predicting the Visiting Count of a Web Page 

With the prediction time increasing from zero second and up, we can predict the next 
Web page visited by an online person, when mt(P.jXN.) value first becomes one. We 
note the corresponding time visiting the next Web page as tj seconds. N. is the total 
transition count from Web page i to other Web pages before the start of prediction. 
lnt{P^jXN) is the predicted count of visitors leaving Web page i and entering Web 
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page j during the prediction time. Thus, we can compute the total transition count 
from all Weh pages to the designation Weh page in the interval of zero to tj seconds. 



Table 2. The transition parameters from all Weh pages to the Web page 2334 in 600 secs. 



URL_id, 


2324 


2539 


2352 


2327 


2442 


2356 


2383 


2388 


2440 


2387 


P, 


0.036 


0.045 


0.035 


0.038 


0.011 


0.038 


0.037 


0.043 


0.024 


0.047 


N. 


214 


4 


74 


187 


25 


9 


108 


39 


2 


19 


inl(Pii*NJ 


7 


0 


2 


7 


0 


0 


3 


1 


0 


0 


Ni' 


220 


4 


75 


190 


26 


9 


111 


40 


2 


21 


Ni’-N, 


6 


0 


1 


3 


1 


0 


3 


1 


0 


2 



URL_id, 


2362 


2333 


2358 


2496 


2375 


2322 


2379 


2349 


2441 


2438 


P. 


0.037 


0.033 


0.029 


0.037 


0.04 


0.044 


0.041 


0.03 


0.039 


0.043 


N, 


60 


38 


26 


15 


101 


46 


38 


67 


14 


13 


int(P,i*Ni) 


2 


1 


0 


0 


4 


2 


1 


2 


0 


0 


Ni' 


62 


38 


27 


15 


103 


47 


39 


70 


15 


13 


Ni'-N, 


2 


0 


1 


0 


2 


1 


1 


3 


1 


0 



URL_id, 


2669 


2325 


2372 


2359 


2357 


2374 


2473 


2329 


2423 


3131 


P. 


0.028 


0.035 


0.021 


0.037 


0.047 


0.046 


0.031 


0.042 


0.035 


0.044 


N, 


6 


150 


22 


65 


9 


38 


16 


25 


25 


7 


int(P,*N,) 


0 


5 


0 


2 


0 


1 


0 


1 


0 


0 


Ni' 


6 


156 


22 


66 


9 


38 


16 


25 


26 


7 


Ni'-N, 


0 


6 


0 


1 


0 


0 


0 


0 


1 


0 



URLJd, 


2990 


2338 


2400 


2389 


2506 


2517 


2505 


2376 


2419 


2682 


R, 


0.035 


0.015 


0.033 


0.052 


0.042 


0.033 


0.035 


0.040 


0.050 


0.046 


N. 


7 


10 


28 


3 


3 


7 


14 


24 


14 


12 


inl(R*N,) 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


Ni' 


7 


10 


29 


3 


3 


8 


16 


25 


14 


13 


Ni'-N, 


0 


0 


1 


0 


0 


1 


2 


1 


0 


1 



URL_id, 


2336 


2405 


2439 


2854 


2447 


2600 


2953 


2411 


2462 


2504 


P. 


0.049 


0.044 


0.020 


0.023 


0.018 


0.038 


0.022 


0.049 


0.029 


0.026 


N, 


26 


20 


6 


5 


9 


3 


5 


4 


9 


4 


int(P,*N,) 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


Ni' 


26 


20 


6 


5 


9 


3 


5 


4 


9 


4 


Ni'-N, 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 



URL_id, 


2561 


2918 


2558 


2578 


2468 


2672 


2767 


2637 


2699 


3129 


P. 


0.033 


0.081 


0.024 


0.032 


0.029 


0.061 


0.029 


0.026 


0.022 


0.040 


N, 


2 


2 


1 


4 


16 


2 


5 


3 


4 


2 


int{P,*N,) 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


Ni' 


2 


2 


3 


5 


16 


2 


5 


3 


4 


2 


Ni'-N, 


0 


0 


2 


1 


0 


0 


0 


0 


0 


0 



Table 2 shows the result of all previous 60 pages from which visitors entered the 
page 2334 within 600 seconds. The second row lists the transition probability values 
(P-) from another page to page 2334 in 600 seconds. Nj (or Nj) is the total visiting 
count from the previous Web page (URL_idj) to Web page 2334 before the start (or 
the end) of the prediction time. N'^ - N. is the actual visiting count from the Web page 
(URL_idj) to Web page 2334 during the prediction time. The total count of visiting 
Web page 2334 in 600 seconds was calculated as 45 using E abs (Pij x Ni) and the 
actual visiting count was 42 according to E abs (Ni -Ni) . 

Figure 3 shows the prediction deviation value (int(P,j x N)-{ Ni ~Ni)), which 
mostly vary in the scope of [-1,1]. The error rate of the total predicted visiting count 
to a Web page is computed as follows. 
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where 



m 

ER.={ Z ER..)/m 
J - I V 

i = \ 



ifmi(P..xN.) = (N. -N.) 

ij i i i 

if ini{P.. XN.)*Q and { n'. -N.) = Q 



abs(int(P.. *N.)~ (N^ - N . )) 



ER..=i 

V 



N. -N. 

i i 



1 



ifintiP.xN.) ^ (n' -N.) and mt(P..xN.) 0 

ij i i i ij i 

and 0 < abs{ int(^y ^ ~ i ~ ^ ^ ^ 

if . xN.)*(n'. -N.)and int(P. ,xN.)i^Q 

ij i i i ij i 

and abs( int(P. .xN .)-(N ' - N .)) >{N - N .) 

IJ 11 11 1 



The error rate of the predicted count in visiting Web page 2334 in Table 2 was 
calculated as 34.4%. The main reason for the error rate is that some people do not 
visit the Web page with the pattern in the training data set. In the prediction model, 
the computation of the transition probability P.. as shown in the following formula 
may be unstable when the parameter n is not large enough. This causes an error in the 
prediction. 



P(t) = e^^= Urn (I + Rt/nf 
n ^ ^ 




0 200 400 600 800 1000 

Redi ct i ng t i rre{ seccnd) 




Fig. 4. The relation of the mn time and 
the predicting time. 



Fig. 5. The run time to the amount 
of the visited pages 



3.5 Scalability Experimental Results 

We conducted some experiments to test the scalability of the continuous Markov 
chain prediction method. The experiments were carried out on a Pentium III, 662 
MHz with 196MB RAM. The NASA data in 39608 seconds from 00:00:00 August 1, 
1995 was used. We computed the transition probability in different predicting time 
length (T). Figure 4 shows the linear relation of the predicting time length and the 
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computing time. For a certain precision value, the relation of the predicting time 
length (7) and the running time is linear too. 

Using a different size of training data sets which included different numbers of 
visited Web pages, we computed the transition probability within the next ten 
seconds. Figure 5 shows the relation of the number of visited Web pages and the 
running time. 



4 Conclusions 

In this paper, we explored using the continuous-time Markov chain to predict not only 
the next Web page a person will visit and the time when it will happen, but also the 
visiting count of the Web page in the same time. The transition probability, computed 
from the continuous-time Markov chain, gives us rich information from which we can 
compute the transition count from one Web page to another, as well as the total 
number of visits to a Web page by all people within a certain period of time. The 
prediction model is validated by an experiment. In the future, we plan to continue to 
explore the application of continuous time Markov models in Web page prediction. 
We will also consider how to apply the prediction result to prefetching applications. 
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Abstract. Recently, several researches have investigated the discovering of 
frequent XML query patterns using frequent structure mining techniques. All 
these works ignore the order properties of XML queries, and therefore are 
limited in their effectiveness. In this paper, we consider the discovering of 
ordered query patterns. We propose an algorithm for ordered query pattern 
mining. Experiments show that our method is efficient. 



1 Introduction 

Recently, several researchers have investigated the discovering of frequent XML 
query patterns using frequent structure mining (FSM) techniques [YLH03a, YLH03h, 
CW03]. Given a set of XML queries, they model them as unordered trees with special 
XML query constructs like descendant edges or wildcards, and FSM techniques are 
extended to extract frequent subtrees based on the semantics of XML queries. 

In contrast to conventional semi-structured data, elements in XML documents are 
ordered, and queries about the order of elements are supported in most popular XML 
query languages. The existing works on the discovering of query patterns ignore the 
order properties of XML queries, and therefore are limited in their effectiveness. In 
this paper, we propose an algorithm for ordered query pattern mining. Our algorithm 
is very efficient since to count the supports of pattern trees, it need only match those 
single branch pattern trees against QPTs in the transaction database, and the supports 
of multi-branch ones can be figured out through reusing intermediate results. 



2 Preliminaries 



A query pattern tree is a labeled ordered tree QPT = <V, E, <>, where V is the vertex 
set, E is the edge set, and < is the order relationship between sibling nodes. Each 
vertex v has a label, denoted by v. label, whose value is in {“*”} u tagSet, where the 
tagSet is the set of all element names in the context. The root of the query pattern tree 
is denoted by root(QPT). A distinguished subset of edges representing ancestor- 
descendant relationships is called descendant edges. 

Figure 1 (a)-(d) shows four QPTs QPT,, QPTj, QPT^ and QPT^. Descendant edges 
are shown with dotted lines in diagrams, and tree nodes are shown with circles, where 
the symbols inside the circle indicate their labels. The tuple (n:i) outside of the circle 
indicates the pre-order number and identifier of the corresponding node respectively. 
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Given any tree node v of a QPT, its node number, denoted by v. number, is assigned 
according to its position in a pre-order traversal of the QPT, and the meaning of its 
identifier will be explained in the following sections. 



In what follows, given a query pattern tree QPT = <V, E, <>, sometimes we also 
refer to V and E with QPT if it’s clear from the context. Given an edge e = (Vj, v^) g 
QPT where is a child of v^, sometimes will be denoted as a d-child of v, if e is a 
descendant edge, and as a c-child otherwise. For simplification of representation, in 
this paper we don’t consider duplicate siblings, however, our result is applicable to 
more general cases. 

Given a query pattern tree QPT, a rooted subtree RST of QPT is a subtree of QPT 
such that root(RST)=root(QPT) holds. Let QPT be a pattern tree, the size of QPT is 
defined by the number of its nodes |QPT|. An RST of size k+1 will be denoted as a k- 
edge RST sometimes. An RST will also be denoted as a single branch RST if it has 
only one leaf node, and as a multi-branch RST otherwise. 

To discover frequent query patterns, one important issue is how to test the 
occurrence of a pattern tree in the transaction database. In this paper, we use the 
concept of tree subsumption [MD02] for the occurrence test. 

Given a transaction database D ={QPTj i = 1,..., n}, we say RST occurs in D if 
RST is subsumed in a query pattern tree QPT; g D. The frequency of RST, denoted as 
Freq(RST), is the total occurrence of RST in D, and supp(RST)=Freq(RST)/|D| is its 
support rate. Given a transaction database D and a positive number 0<O<l called the 
minimum support, mining the frequent query patterns of D means to discover the set 
of RSTs of D, Fp = {RSTj, ..., RST,^,}, such that for each RST g F^, supp(RST) > <T. 



3 Discovering Frequent Query Pattern Trees 

Given a transaction database Z3={QPTj i = 1,..., n}, we construct its global query 
pattern tree G-QPT as follows: At first, a root is created, its label value will always be 
the name of the root element of the related DTD. Next, for each QPT in the 
transaction database, we merge it with G-QPT as follows: the root of the QPT is 
always merged with the root of the G-QPT; and for each other node u of the QPT that 
is a c-child (or d-child respectively) of its parent, if its parent is merged with a tree 
node p of the G-QPT, and there exists a c-child (or d-child respectively) q of p such 
that ij. label = m. label holds, then u is merged with q, otherwise, a new node q is 
inserted as a c-child (or d-child respectively) of p, and ij. label is set to m. label. After 




Fig. 1. Query Pattern Trees. 



Fig. 2. G-QPT. 



Discovering Ordered Tree Patterns from XML Queries 



561 



all the QPTs have been processed, we assign each node p of the G-QPT a number, 
denoted as its identifier or p.id, through a pre-order travesal. Figure 2 shows an 
example of G-QPT obtained from the QPTs in Figure 1 (a), (b), (c) and (d). 

Because each node of QPTgD is merged with a unique node of the G-QPT, each 
node of QPT has the same identifier as the corresponding node in the G-QPT (see 
Figure 1). After labeling each node with an identifier, the representation of QPTs can 
be simplified to string format. For example, QPT^ can be simplified as “1, 3, 4, -1, -1, 
2, -1”. We will call such encoding strings as string encodings of QPTs. 

As in [YLFIOSb], we only try to discover pattern trees that are rooted subtrees of 
some QPTs in the transaction database. In [YLH03b], a schema-guided rightmost 
expansion method is proposed to enumerate all unordered rooted subtrees of the G- 
QPT. In our settings, the schema-guided rightmost expansion method can’t be directly 
used because the G-QPT doesn’t preserve the order relationship between sibling 
nodes. For example, “1, 3, -1, 2, -1” is rooted subtree of QPT^, but it’s not a rooted 
subtree of the G-QPT, and we can’t generate it with the method used in [YLH03b]. 

To handle this issue, we modify the schema-guided rightmost expansion method to 
generate RSTs as follows: Given a k-edge RST‘‘g [RST'‘'‘]={RSTj, RST,,..., RST^^) 
sorted in ascending order of their string encodings, let rmlne(RST*')={RST RST'^’^' is 
RMLNE of RST"} and JR(RST")= {RST""'|RST""‘=RST M RST, j=iH-l,...,N| u 
{RST""‘|RST""'=RST M RST, where j=l,...,i-l, diff(Sj,s.) = 1, Sj and s^ are string 
encodings of RST^ and RST^ respectively}, then [RST"] = rmlne(RST")uJR(RST") 
holds. Definition of equivalence class [RST" ‘] and string comparison function diff() 
can be found in [YLH03b]. 

The main idea of QPMiner is similar to [CW03]. It uses the schema-guided 
rightmost enumeration method to enumerate candidate RSTs level-wise, counts the 
frequency of each candidate RST, and prunes infrequent RSTs based on the anti- 
monotone property of tree subsumption. However, QPMiner uses different method to 
count the frequency of candidate RSTs. 

Given an rooted subtree RST and a query pattern tree QPT, where the list L=<v^, 
..., v„,> is the set of nodes of RST sorted in pre-order, assume that RST c QPT holds 
and sim is the simulation relation between their nodes, then the Proper Combinations 
of sim is the set PC = (<v'j, ..., v'^>\ v'.e QPT, i = 1, ..., m] such that for each <v\, 
v',„>e PC: if is a c-child of v^, v'. must be a c-child of v\\ if v. is a d-child of v^, v'. 
must be a proper descendant of v'^, where j, k are any integers such that I < j, k < m 
holds. A proper combination is called a strong proper combination iff the following 
condition holds: given any two sibling nodes and of RST, v. < iff vf. number < 

v'^.number. 

Given an rooted subtree RST and a query pattern tree QPT where the list L=<Vj,..., 
V ,..., vc> is the set of nodes of RST sorted in pre-order, v and is the second 
rightmost leaf and the rightmost leaf of RST respectively, assume that RST c QPT 
holds, sim is the simulation relation between their nodes, then the rightmost 
occurrence of RST in QPT is the set rmo(RST, QPT) = { <v'^, v'^>| <v\, v\> is a sub- 
list of some proper combination <v',,..., v'^, ..., v\> of sim}. The rightmost 
occurrence rmo(RST, QPT) is called a strong rightmost occurrence iff v'^, ..., 

v'j,> is a strong proper combination. The rightmost occurrence rmo(RST, QPT) will 
also be denoted as srmo(RST, QPT) if it’s a strong rightmost occurrence. 
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Given an RST, a transaction database D and its global query pattern tree G-QPT, 
the strong rightmost occurrence of RST in D is the set srmo(RST, D) = {<u, v, 
{<QPT.tid, M. number, v.number>| for all QPTeZ) such that <u, v> g srmo(RST, 
QPT))>| <u, v> e rmo(RST, G-QPT) }. 

Given a rooted subtree RST and a query pattern tree QPT, where v,, v,„> 

is the set of nodes of RST sorted in pre-order, and v, is a node of the rightmost branch 
of RST, assume that RST c QPT holds, sim is the simulation relation between their 
nodes, and <Vj, v'>g sim, we define the Conditional Rightmost Occurrence satisfying 
<v^,v'>esim as the set {v'„, | there exists a proper combination of sim < v',,..., v'.,, ..., 
v'^> such that v'j = v'}. We denote it as rmo^„j QPT). 

Theorem 1: Given a transaction database D, its global query pattern tree G-QPT, 
and two k-edge RSTs RST,, RST^ e [RST‘‘ '], let RST'‘*'= RST, N RST^, p is the node 
of RST''*' not present in RST, (i.e., the rightmost leaf of RST^), and the junction node 
q is parent of p, then we have: 

1. If RST, is a single branch RST, then srmo(RST'‘*', D) = {<m, u', {<tid, m. number, 
M'.number>| <tid, v. number, M.number>G List,, <tid, v'.number, M'.number>G List^, 
M. number < M'.number}| <v, u, List,>c srmo(RST,, D), <v', u, Listj> c srmoCRST^, 
D), where u g rmo^ ,'>£sim(RST„ G-QPT), ue rmo^^ ,'>esm(RST 2 , G-QPT), and q' g 
G-QPT }. 

2. If both RST, and RST^ are multi-branch RSTs, and the parent of the rightmost leaf 
of RST, has only one child, then srmo(RST'‘*', D) = {<u, u, {<tid, m. number, 
«'.number>| <tid, v. number, M.number>G List,, <tid, v'.number, w'.number>G List^, 
M. number < M'.number}> | <v, m, List,>c srmo(RST,, D), <v', u, Listj>c 
srmo(RSTj, D), u « v', where u g rmo^ ,'>6 sim(RST,, G-QPT), m'g rmo^ 
,'>ssim(RST 2 , G-QPT), and q' g G-QPT }. Here u « v' means v' is the parent (or 
ancestor respectively) of u if the rightmost leaf of RST, is a c-child (or d-child 
respectively) of its parent. 

3. Otherwise, srmo(RST‘‘*‘, D) = {<m, u, {<tid, m. number, M'.number>| <tid, 
V. number, «.number>G List,, <tid, v'.number, M'.number>G Listj, m. number < 
M'.number}| <v, u, List,>c srmo(RST,, D), <v', u, List 2 >c srmo(RST 2 , D), u = v', 
where u e rmo^ ,'>gs,m(RST„ G-QPT), m'g rmo^^ ,'>gsm(RST 2 , G-QPT), and q g G- 
QPT }. 

Theorem 2: For any multi-branch k-i-l-edge rooted subtree RST''*' generated 
through rightmost leaf node expansion of a k-edge rooted subtree RST,, there must be 
another k-edge rooted subtree RST^ which is formed by cutting off the second 
rightmost leaf node of RST''*' such that the join of RST, and RST^ will produce RST''*' 
itself. If RSTj exists in F^., then we have: rmo(RST'‘*', D) = {<v, u, {<tid, m. number, 
M'.number>| <tid, v.number, M.number>G List,, <tid, v'.number, w'.number>G List 2 )>| 
<v, M, List,>c rmo(RSTj, D), <v', u, Listj>c rmo(RST 2 , D), u « u], otherwise, 
RST''*' must be infrequent. 

Based on Theorem 1 and Theorem 2, only those single-branch RSTs need to be 
matched with QPTs. The frequencies of other RSTs can be computed through reusing 
intermediate results. 



Discovering Ordered Tree Patterns from XML Queries 



563 



4 Conclusion 

We performed three sets of experiments to evaluate the performance of the QPMiner 
algorithm. We implement an algorithm, baseMiner, as the base algorithm to compare 
QPMiner against. By associating each RST with a TIDList attribute that contains 
transaction IDs of QPTs in which the RST occurs, baseMiner need only match 1-edge 
RSTs against each QPT in the transaction database. To count the frequency of a k- 
edge RST RST‘‘ (k>l), if RST'‘ is a single-branch RST, baseMiner need only to match 
RST** against QPTs whose transaction IDs are in the set RST'‘'‘.tidlist, where RST'‘ ' is 
the k-l-edge RSTs that is a rooted subtree of RST*^. If RST*‘ is a multi-branch k-edge 
RST, then baseMiner need only to match RST‘‘ against QPTs whose transaction IDs 
are in the set RSTj.tidlistnRSTj.tidlist, where RST^ and RST^ are k-l-edge RSTs that 
are rooted subtrees of RST*^. 

The experiments shows that QPMiner is 5-7 times faster than baseMiner. The 
reason is that baseMiner will match each candidate RST against QPTs in the 
transaction database while QPMiner processes the majority of candidate RSTs 
without matching operation. In addition, matching time of single-branch RSTs is 
much less than that of multi-branch RSTs. Consequently, QPMiner takes much less 
time than baseMiner. 

Future work includes incremental computation of frequent RSTs. To incorporate 
the result of this paper into caching system, an important issue is to guarantee the 
freshness of the mining result. However, if the pattern of user activity changes at a 
relatively high rate, the accuracy of the mining result may deteriorate fast. Because re- 
computation will incur a high overhead, finding a method to discover frequent RSTs 
incrementally becomes very important. 
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Abstract. As the world-wide-web grows rapidly and a user's browsing experi- 
ences are needed to be personalized, the problem of predicting a user's behavior 
on a web-site has become important. We present a probability model to utilize 
path profiles of users from web logs to predict the user's future requests. Each 
of the user's next probable requests is given a conditional probability value, 
which is calculated according to the function presented by us. Our model can 
give several predictions ranked by the values of their probability instead of 
giving one, thus increasing recommending ability. The experiments show that 
our algorithm and model has a good performance. The result can potentially be 
applied to a wide range of applications on the web. 



1 Introduction 

Web mining is the application of data mining technology to huge Web data reposito- 
ries. The purpose of this paper is to explore ways to exploit the information from web 
logs for predicting users’ actions on the weh. There has been an increasing amount of 
related work. For example, Syskill & Webert [4] is designed to help users distinguish 
interesting web pages on a particular topic from uninteresting ones. WebTool, an in- 
tegrated system [5], is developed for mining either association rules or sequential 
patterns on web usage mining to provide an efficient navigation to the visitor. In [6], 
the authors proposed to use Markov chains to dynamically model the URL access 
patterns. 

In this paper, we present a probability model to predict the user’s next request. 
From a sequence of previous requests, instead of giving only one prediction, we can 
give several predictions ranked by the values of their probability. We present a func- 
tion to calculate the values of their conditional probability and present an efficient al- 
gorithm to implement it. 



2 Construct the WAP-Tree 

There are three main tasks for performing Weh Usage Mining or Weh usage Analysis: 
Preprocessing, Pattern Discovery, Pattern Analysis [2]. 
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After preprocessing is applied to the original Web log files, pieces of Web logs can 
be obtained. Let E be a set of events. A Web log piece or (Web) access sequence 

S = 6^62 • • • (e, e E) for (1 < ) < n) is a sequence of events. 

Our algorithm is based on the compact web access pattern tree (WAP- tree) [1] 
structure. Due to different purpose, our WAP-Tree has some difference. In order not 
to lose information, we do not do any truncation to WAP-Tree during construction. 

Each node in a WAP-Tree registers two pieces of information: label and count, de- 
noted as label: count. The root of the tree is a special virtual node with an empty label 
and count 0. Auxiliary node linkage structures are constructed to assist node traversal 
in a WAP-Tree as follows. All of the nodes in the tree with the same label are linked 
by shared-label linkages into a queue, called event-node queue. The event-node queue 
with label P, is also called ej-queue. There is one header table H for a WAP-Tree, and 

the head of each event-node queue is registered in H. 

The specific process of the algorithm of construction of WAP-Tree can be referred 
in [1]. We can also build our WAP-tree from frequent sequences which are mined 
from original sequences with minimum support. We will discuss this in session 4. 



3 Predict Future Requests 

Using the previous requests to predict the next can be treated as a problem to find the 
maximum conditional probability. Let E be the set of web pages, and CjCj ‘ be 

the previous requests. Our target is to find which satisfy the following equation: 

I =Max{p{ei I e,&E, where p{e, \ is 

the conditional probability of request for page e, at the next step. 

Practically, it is not easy to find the ideal maximum probability (e,. ) for a 

certain user, because there are many facts which affect the probability and some of 
which are difficult to get and describe in math form. Here we use the user’s previous 
requests in the same session to predict the next request, without taking personal in- 
formation into account, which is needed for Syskill & Webert [5]. Comparing to well 
known WebWatcher [3], we also do not require the web site link structure. We only 
use the logs of web site. This greatly simplifies the process of prediction and without 
losing much accuracy, and even sometimes may increase it. 

The previous requests , e ^__2 U ' ' > have different influences to the prediction 

of the future request . We suppose that the most recent request has the strongest 
influence, and in most the fact is so. To extend this, we give a coefficient to weigh the 
influence of every item in the user’s previous request. Let B = h ^2 

note the n-sequence gotten from web logs, denote the support of sequence B, Cj, 
denote the coefficient called weight here of k’s event in the user’s previous requests, 
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and Q, denote the collection of all n-sequences in web logs which satisfy = e, . 
We define: 



(0{b,) 



jo 3bj^ejJ>k 
[1 b^=e^yj>k 



We calculate the probability value of to appear at the next step : 



k„_ie„_2---ei)= X' 



/b 



BeUi [fs 



Vt=i 



, Af is a constant 



P(e, k„_ie„_2 



keE 



( 1 ) 

( 2 ) 



We use the following rule to C^. . 

k = 2,3,---,n — \ , Cf is a constant. (3) 

Function (3) guarantees that no matter what is, the result will not change if 
only CX is the same. Simply, we set = 1 , then = Cf ” * * . 

In order to find the maximum value of P{e- \ ‘ ) , we only need to find 

the maximum value of Q{e^ \ ‘ ’ '^i) ■ mark it as and calculate 

Qi^i I ■ ■ '^i) efficiently basing on the WAP -tree [1]. 

Algorithm 1 (Predicting users’ future requests with WAP-tree) 

Input: a WAP-tree [1] constructed from web logs, a user’s previous request sequence 
^1 ’ ^h -2 ^n-\ > constant M and CC , and number n. 

Output: the n events of the user’s most probable requests at the next step. 

Method: 

(1). Initialize optional events selL = (f> . 



(2). Following ’s node-queue of the Header Table of the tree, for each node 
(marked as 0 ) in the ’s node queue. 

(a) initialize A — 0, f5 — \ , node O'” = 0 , event e = e^__^ .mark the parent 

node of node 0'’ as parent ( 0^ ), and mark the event exactly before event e 
as parent {e ). 

(b) following node 0 ’s links in the tree from child to parent. 
while (the label of node 0^ is the same as event e ) 

X X + P , P P a 

0 ' <— parent(Q') , e <— parent(e') 
end while 

(c) for each child node of node 0 , 
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mark the count value of the child node as f , then AQ ^ 



/ 



■ a 



f + M 

If the label of the child node hasn’t existed in the optional events setL , create 
a new item with the label that is the same as that of the child, and set it’s 
Q <— t^Q . Otherwise, update Q value of the item with the same label in the 

set L , 2 <— 2 + t^Q . 



(3) For all items in the optional set L , select the n top ones with largest Q values, 
return their labels, which denote the events. 

The algorithm shows that we only need to scan part of the tree once. In addition, 
the tree isn’t needed to be constructed again when making another prediction. Gener- 
ally, we can put the tree in the memory, and it can be updated easily. We just need to 
put the user’s new sequence to the tree according to the rule of construction. 





Fig. 1. (a) A curve showing the change of precision according to different quantity of returning 
events, (b) Precision of top events ranked by Q values. Each histogram represents the precision 
percentage of the event with a certain rank 



4 Experimental Evaluation and Conclusions 

In the evaluation of the algorithm, we use the following measures. Let 

5 = (iSj Aj , • • • , } be the set of sequences in a log ftle. We build WAP-tree Models 

on a subset of these sequences, known as the training sequences. The remaining is 
used as testing sequences. If the returned n top events from the algorithm contain the 

factual next request, we say that it is a correct prediction. Let and P be set of 
correct and incorrect predictions separately, and R be the set of all requests. For some 
requests our algorithm can’t give prediction (in the case that in the end the optional 

set L is empty), so normally/*^ + P~ ^| ./? | . We use the following measures [6]: 

P* P" + 

precision = — — — , applicability = 



R 
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We use Microsoft anonymous web data as experimental data. The data records the 
use by 38000 anonymous users. It can be downloaded from the website [7]. 

As we use the original sequences to build our WAP-tree, almost all the testing re- 
quests can be given predictions. So in our experiments, applicability = 1 . If our WAP- 
tree is built from frequent sequences, then applicability will decrease, but 
precision will increase. 

As applicability = 1, here we only need to analyze precision . 

Let M = 50 , OC — 2 , and we change parameter n which denotes the returning 
quantity of predicting events from 1 to 10. The value of precision pevc&vA&ge. is 
shown as figure 1(a); 

Further, we get different events predicting ability as figure 1(b). Figure 1(b) ex- 
plains that for the n returned events, the greater Q value of one event, the greater 
probable precision the event can be. This means that Q can approximately represent 
the actual conditional probability. 

As we can see, the event with the greatest Q value still hasn’t very great precision 
percentage. There are some reasons. The important one is that, different users have 
different request sequences. We can reduce applicability to increase precision. As we 
say previously, we can first mine the web logs to collect frequent sequences with cer- 
tain minimum support and build WAP-tree from the frequent sequence instead of 
from original sequences. This method not only increase precision, but also decrease 
the time consumed in prediction. One shortage of this method is that for some previ- 
ous sequences, it can not give predictions, or we say that applicability will decrease to 
less than 1. To let applicability equal 1 or approximately equal 1 and still need high 
precision, we can use n top ones instead of the greatest one. In some cases, we need to 
compromise between precision and applicability. This is very useful in recommenda- 
tion systems. 
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Abstract. Correlated pattern mining has become increasingly impor- 
tant recently as an alternative or an augmentation of association rule 
mining. Though correlated pattern mining discloses the correlation re- 
lationships among data objects and reduces significantly the number of 
patterns produced by the association mining, it still generates quite a 
large number of patterns. In this paper, we propose closed correlated 
pattern mining to reduce the number of the correlated patterns pro- 
duced without information loss. We first propose a new notion of the 
confidence- closed correlated patterns, and then present an efficient al- 
gorithm, called CCMine, for mining those patterns. Our performance 
study shows that confidence-closed pattern mining reduces the number 
of patterns by at least an order of magnitude. It also shows that CCMine 
outperforms a simple method making use of the the traditional closed 
pattern miner. We conclude that confidence-closed pattern mining is a 
valuable approach to condensing correlated patterns. 



1 Introduction 

Though association rule mining has been extensively studied in data mining 
research, its popular adoption and successful industry application has been hin- 
dered by a major obstacle: association mining often generates a huge number 
of rules, but a majority of them either are redundant or do not reflect the true 
correlation relationship among data objects. To overcome this difficulty, inter- 
esting pattern mining has become increasingly important recently and many 
alternative interestingness measures have been proposed [1,2, 3,4]. 

While there is still no universally accepted best measure for judging interest- 
ing patterns, alLconfidence [5] is emerging as a measure that can disclose true 
correlation relationships among data objects [5, 6, 7, 8]. One of important proper- 
ties of alLconfidence is that it is not influenced by the co-absence of object pairs 
in the transactions — such an important property is called null-invariance [8]. The 
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co-absence of a set of objects, which is normal in large databases, may have unex- 
pected impact on the computation of many correlation measures. AlLconfidence 
can disclose genuine correlation relationships without being influenced by ob- 
ject co-absence in a database while many other measures cannot. In addition, 
alLconfidence mining can be performed efficiently using its downward closure 
property [5]. 

Although the alLconfidence measure reduces significantly the number of pat- 
terns mined, it still generates quite a large number of patterns, some of which are 
redundant. This is because mining a long pattern may generate an exponential 
number of sub-patterns due to the downward closure property of the measure. 
For frequent itemset mining, there have been several studies proposed to reduce 
the number of items mined, including mining closed [9], maximal [10], and com- 
pressed (approximate) [11] itemsets. Among them, the closed itemset mining, 
which mines only those frequent itemsets having no proper superset with the 
same support, limits the number of patterns produced without information loss. 
It has been shown in [12] that the closed itemset mining generates orders of 
magnitude smaller result set than frequent itemset mining. 

In this paper, we introduce the concept of confidence closed correlated pattern, 
which plays the role of reducing the number of the correlated patterns produced 
without information loss. AlLconfidence is used as our correlation measure. How- 
ever, the result can be easily extended to several other correlation measures, such 
as coherence [6]. First, we propose the notion of the confidence-closed correlated 
pattern. Previous studies use the concept of support-closed pattern, i.e., the 
closed pattern based on the notion of support. However, support-closed pat- 
tern mining fails to distinguish the patterns with different confidence values. 
In order to overcome this difficulty, we introduce confidence-closed correlated 
pattern which encompasses both confidence and support. Then we propose an 
efficient algorithm, called CCMine, for mining confidence-closed patterns. Our ex- 
perimental and performance study shows that confidence-closed pattern mining 
reduces the number of patterns by at least an order of magnitude. It also shows 
that superiority of the proposed algorithm over a simple method that mines the 
confidence-closed patterns using the patterns generated by the support-closed 
pattern miner. 

The rest of the paper is organized as follows. Section 2 introduce basic con- 
cepts of frequent itemset mining and alLconfidence. Section 3 defines the notion 
of the confidence-closed patterns and discusses its properties. Section 4 presents 
an efficient algorithm for mining confidence-closed correlated patterns. Section 5 
reports our experimental and performance results. Finally, Section 6 concludes 
the paper. 



2 Background 

We first introduce the basic concepts of frequent itemset mining, and then 
present a brief review of the alLconfidence measure. 
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Let 7 = {ii, * 2 , . . . , im} be a set of items, and DB be a database that consists 
of a set of transactions. Each transaction T consists of a set of items such that 
TCI. Each transaction is associated with an identifier, called TID. Let A be a 
set of items, referred to as an itemset. An itemset that contains k items is a k- 
itemset. A transaction T is said to contain A if and only if A C T. The support 
of an itemset X in DB, denoted as sup{X), is the number of transactions in 
DB containing X. An itemset X is frequent if it occurs no less frequent than a 
user-defined minimum support threshold. In most data mining studies, only the 
frequent itemsets are considered as significant and will be mined. 

The alLconfidence of an itemset X is the minimal confidence among the set 
of association rules ij^X — ij, where ij C X. Its formal definition is given as 
follows. Here, the max_item_sup of an itemset X means the maximum (single) 
item support in DB of all the items in X. 

Definition 1. (all- confidence of an itemset) Given an itemset X = 
{fi,Z 2 ,--- Ak}, the alLconfidence of X is defined as, 

max -item _sup{X) = max{sup{ij)\iij € A} (1) 



all-Conf{X) 



sup{X) 

max -item -Sup{X) 



( 2 ) 



Given a transaction database DB, a minimum support threshold minsup 
and a minimum alLconfidence threshold min -a, a frequent itemset X is 
all -Confident or correlated if all-Conf(X) > miu-a and sup{X) > minsup. 



3 Confidence-Closed Correlated Patterns 

It is well known that closed pattern mining has served as an effective method 
to reduce the number of patterns produced without information loss in frequent 
itemset mining. Motivated by such practice, we extend the notion of closed 
pattern so that it can be used in the domain of correlated pattern mining. We 
present the formal definitions of the original and extended ones in Definitions 2 
and 3, respectively. In this paper, we call the former support- closed and the latter 
confidence-closed. 

Definition 2. (Support-Closed Itemset) An itemset Y is a support-closed 
(correlated) itemset if it is frequent and correlated and there exists no proper 
superset Y' C> Y such that sup{Y') = sup{Y). 

Since the support-closed itemset is based on support, it cannot retain the 
confidence information — notice that in this paper confidence means the value of 
all-confidence. In other words, support-closed causes information loss. 

Example 1. Let itemset ABCDE be a correlated pattern with support 30 and 
confidence 30% and itemset CDE be one with support 30 and confidence 80%. 
Suppose that we want to get a set of non-redundant correlated patterns when 



572 



W.-Y. Kim, Y.-K. Lee, and J. Han 



min^sup = 20 and min^a = 20%. Support-closed pattern mining generates 
ABODE only eliminating ODE since ABODE is superset of ODE with the 
same support. We thus lose the pattern ODE. However, ODE might be more 
interesting than ABODE since the former has higher confidence that the latter. 

We thus extend the support-closed itemset to encompass the confidence so 
that it can retain the confidence information as well as support information. 

Definitions. (Confidence-Closed Itemset) An itemset Y is a confidence- 
closed itemset if it is correlated and there exists no proper superset Y' Z) Y 
such that sup{Y') = supfY) and alEconf{Y') = alEconf{Y). 

By applying mining of confidence-closed itemsets to Example 1, we can obtain 
not only itemset ABODE but also ODE as confidence-closed itemsets since they 
have different confidence values and therefore no information loss occurs. In the 
rest of our paper, we call the support-closed pattern as SCP and the confidence- 
closed pattern as CCP, respectively. 

4 Mining Confidence-Closed Correlated Patterns 

In this section, we propose two algorithms for mining CCPs: CCFilter and CCMine. 
CC Filter is a simple algorithm that makes use of the existing support-closed pat- 
tern generator. CCFilter consists of the following two steps: First, get the com- 
plete set of SCPs using the previous proposed algorithms [13]. Second, check each 
itemset and its all possible subsets in the resulting set whether it is confidence- 
closed. If its confidence satisfies min^a and it has no proper superset with the 
same confidence, it is generated as a confidence-closed itemset. CCFilter is used 
as a baseline algorithm for comparison in Section 5. 

CCFilter has a shortcoming: It generates SCPs with less confidence than 
min^a during the mining process. At the end, these patterns are removed. In 
order to solve this problem, CCMine integrates the two steps of CCFilter into one. 
Since alLconfidence has the downward closure property, we can push down the 
confidence condition into the process of the confidence-closed pattern mining. 

CCMine adopts a pattern-growth methodology proposed in [14]. In the previ- 
ous studies (e.g., CLOSET-1- [13] and CHARM [15]) for mining SCPs, two search 
space pruning techniques, item merging and sub-itemset merging, have been 
mainly used. However, if we apply these techniques directly into confidence- 
closed pattern mining, we cannot obtain a complete set of CCPs. This is because 
if there exists a pattern, these techniques remove all of its sub-patterns with 
the same support without considering confidence. We modify these optimization 
techniques so that they can be used in confidence-closed pattern mining. 

Lemma 1. (confidence-closed item merging) Let X he a correlated itemset. If 
every transaction containing itemset X also contains itemset Y but not any 
proper superset of Y , and alEconf{XY) = alEconf(X), then XY forms a 
confidence-closed itemset and there is no need to search any itemset contain- 
ing X but no Y . m 
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Lemma 2. (confidence-closed sub-itemset pruning) Let X be a correlated item- 
set currently under construction. If X is a proper subset of an already found 
confidence-closed itemset Y and all-Conf(X) = all-Conf(Y) then X and all of 
X ’s descendants in the set enumeration tree cannot be confidence-closed itemsets 
and thus can be pruned. 

Lemma 1 means that we have to mine the X-conditional database and the 
XF-conditional database separately if alLconf(X) ^ alLconf(XY). However, 
though alLconf(X) and alLconf(XY) are different, the X- and XY- conditional 
databases are exactly the same if sup{X) = sup{XY). Using this property, we 
can avoid the overhead of building conditional databases for the prefix itemsets 
with the same support but different confidence. We maintain a list candidateList 
of the items that have the same support with the size of the X-conditional 
database but are not included in the item merging because of their confidence. 
The list is constructed as follows. For X-conditional database, let Y be the set 
of items in f_list such that they appear in every transaction. Do the following: 
for each item Yi in U, if supfYi) < max Atem_sup{X) , X = X U Yp, otherwise 
insert Yi to candidateList. When we check whether an itemset Z containing 
X{Z D X) is confidence-closed, we also check whether the itemset ZiJY'{Y' = 
Yi . . .Yk,Yi G candidateList) could be confidence-closed. Using this method, we 
compute CCPs without generating the two conditional databases of X and of 
XY when all-Conf{X) > all-Conf{XY) and sup{X) = sup{XY). 

Algorithm 4 shows the CCMine algorithm, which is based on the extension of 
CLOSET 4- [13] and integrates the above discussions into the CLOSET-1-. Among 
a lot of studies for support-closed pattern mining, CLOSET-1- is the fastest algo- 
rithm for a wide range of applications. 

CCMine uses another optimization technique to reduce the search space by 
taking advantage of the property of the alLconfidence measure. Lemma 3 de- 
scribes the pruning rule. 

Lemma 3. (counting space pruning rule) Let a = i\i 2 ...ik- Cn the a- 
conditional database, for item x to be included in an alLconfident pattern, the 
support of X should be less than sup{a) / min^a. u 

Proof. In order for ax to be an alLconfident pattern, max Stem _sup{ax) < 
sup{ax)/min-a. Moreover, |sMp(a)| > \sup{ax)\. Thus, max Stem sup{ax) < 
sup(a)/TOm_a. Hence the lemma. ■ 

With this priming rule, we can reduce the set of items I /3 to be counted and, 
thus, reduce the number of nodes visited when we traverse the FP-tree to count 
each item in 

Example 2. Let us illustrate the confidence-closed mining process using an ex- 
ample. Figure 1 shows our running example of the transaction database DB. 
Let min.sup = 2 and min.a = 40%. Scan DB once. We find and sort the list of 
frequent items in support descending order. This leads to Llist = (a:9, &:7, c:6, 
e:6, g:5, /:4, d:3, i:3, fc:3, j:2, h:l). Figure 2 shows the global FP-tree. For lack 
of space, we only show two representative cases: mining for prefix j:2 and eg:b. 
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Algorithm 41 CCMine: Mining confidence-closed correlated patterns 
Input: a transaction database DB\ a support threshold min sup 
a minimum alLconfidence threshold min^a 
Output: The complete set of confidence-closed correlated patterns. 

Method: 

1. Let CCP be the set of confidence-closed patterns. Initialize CCP •<— 0 

2. Scan DB once to find frequent items and compute frequent list fJist(={fo, /i, . . . )). 

3. Call CCMine(0, DB, fjist, CCP, 0). 

ProcedureCCMine( a, CDB, fJist, CCP, candidateList) 

1: For each item Y in fjist such that it appears in every transaction of CDB, delete 
Y from fJist and set a Y U a if alLconf(Ya) > alPconf{a), otherwise in- 
sert Y into candidateList in the support increasing order; {confidence-closed item 
merging} 

2: call GenerateCCP(a, candidateList, CCP); 

3: build FP-tree for CDB using fJist, which excludes all the items Ys in the previous 
step; 

4: for each ai in fJist (in reverse descending support order) do 
5: set d = a U Oi; 

6: call GenerateCCP(d, candidateList, CCP); 

7: get a set Ip of items to be included in /?-projected database; {counting space 

pruning rule} 

8: for each item in Ip, compute its count in /3-projected database; 

9: for each bj in Ip do 

10: if sup{j3bj) < minsup, delete bj from Ip; {pruning based on minsup} 

11: if all_conf{l3bj) < min_ao, delete bj from // 3 ;{pruning based on min_a} 

12: end for 

13: call FP-mine(d, CDB, fJist, CCP, candidateList); 

14: delete the items that was inserted in step 1 from candidateList; 

15: end for 

Procedure GenerateCCP( a, candidateList, CCP) 
for fc-itemset Y = Yl . . . YkiYi £ candidateList) do 

add a U Y into CCP if all_conf{a U Y) > min_a if a U Y is not a subset of X 
(in CCP) with the same support and confidence; {confidence-closed sub-itemset 
pruning} 
end for 



1. After building the FP-tree we mine the confidence-closed patterns with prefix 
j:2. Computing counts: We compute the counts for items a, c, e, /, and i 
to be included in the j-projected database by traversing the FP-tree shown 
in Fig. 2. First, we use Lemma 3 to reduce items to be counted. The support 
of item z{z € |a,c, e, /, ij) should be less than or equal to sup{j) /min-a = 
2/0.4 = 5. With this pruning, items a, c and e are eliminated. Now, we 
compute counts of items / and i and construct /-projected database. They 
are 2 and 1, respectively. Pruning: We conduct pruning based on minsup 
and miu-a. Item i is pruned since its support is less than minsup. Item 
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TID 


items bought 


10 


a,b,c,d,e,g 


20 


a,b,c,e,g,k 


30 


a,b, c,e,g 


40 


a, b, d, f, h 


50 




60 


a, b, e, g, k 


70 


a, i, k 


80 


a, c 


90 


b,c,f 


100 


a,d,e,g 


no 


c, /, j 


120 


b, i 



Fig. 1. An transaction 
database DB. 




Fig. 2. FP-tree for the transaction database DB. 



/ is not pruned since and its confidence(2/4) is not less than miri-a. Since 
/ is the only item in j-conditional database, we do not need to build the 
corresponding FP-tree. And fj:2 is a CCP. 

2. After building conditional FP-tree for prefix g:5 and we mine g: 5-conditional 
FPtree with f list = (a:5, e:5, b:4, c:3). Confidence Item Merging: We 
try confidence-closed item merging of a and e. We delete a and e from fJist. 
Since all-Conf{ag) < all-Conf{g), we insert a into candidateList. Then, we 
extend the prefix from g to eg by the confidence-closed item merging. Gener- 
ateCCP: we generate eg:5 as a CCP. In addition, we also generate aeg:5, in 
which item a come from candidateList. Now, in fJist, only two item b:4 and 
c:3 left. We mine the CCPs with prefix ceg:3. First, we generate ceg as a CCP. 
However we can not generate aceg as CCP since all-Conf{aceg) < min_a. 
Since items b the only item in f list, bceg is a CCP. Again, abceg can not 
be CCP, since it also does not satisfy min^a. In this way, we mine the beg\A- 
conditional database and generate beg and abeg as a CCP. After returning 
mining 6eg:4-conditonal FP-tree, item a is removed from candidateList. 

5 Experiments 

In this section, we report out experimental results on the performance of CCMine 
in comparison with CCFilter algorithm. The result shows that CCMine always 
outperforms CCFilter especially at low min^sup. Experiments were performed 
on a 2.2GHz Pentium IV PC with 512MB of memory, running Windows 2000. 
Algorithms were coded with Visual C-F-F. 

Our experiments were performed on two real datasets, as shown in Ta- 
ble 1. Pumsb dataset contains census data for population and housing and is 
obtained from http://www.almaden.ibm.com/software/quest. Gazelle, a trans- 
actional data set comes from click-stream data from Gazelle.com. In the table, 
ATL/MTL represent average/maximum transaction length. The gazelle dataset 
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is rather sparse in comparison with piimsb datasets, which is very dense so that 
it produce many long frequent itemsets even for very high values of support. 



Table 1. Characteristics of Real Datasets. 



Dataset 


#Tuples 


^Items 


ATL/MTL 


gazelle 

pumsb 


59602 

49046 


497 

2113 


2.5/267 

74/74 



We first show that the complete set of CCPs is much smaller in comparison 
with both that of correlated patterns and that of SCPs. Figure 3 shows the 
number of CCPs, correlated patterns, and SCPs generated from the gazelle data 
set. In this figure, the number of patterns is plotted on a log scale. Figure 3(a) 
shows the number of patterns generated when min^sup varies and min^a is 
fixed while Figure 3(b) shows those when miri-a varies and minsup is fixed. 
We first describe how many we can reduce the number of correlated patterns 
with the notion of CCPs. Figures 3(a) and 3(b) show that CCP mining generates 
a much smaller set than that of correlated patterns as the support threshold or 
the confidence threshold decreases, respectively. It is a desirable phenomenon 
since the number of correlated patterns increases dramatically as either of the 
thresholds decreases. These figures also show that the number of SCPs is quite 
bigger than that of CCPs over the entire range of the support and confidence 
threshold. These results indicate that CCP mining generates quite a smaller 
set of patterns even at the low minimum support threshold and low minimum 
confidence threshold. 





(a) min_a = 25% (b) min_sup = 0.01% 

Fig. 3. Number of patterns generated from the gazelle data set. 



Let us then compare the relative efficiency and effectiveness of the CCMine 
and CCFilter methods. Figure 4 (a) shows the execution time of the two methods 
on the gazelle dataset using different minimum support threshold while min_a is 
fixed at 25%. Figure 4(a) shows that CCMine always outperforms CCFilter over 
the entire supports of experiments. When the support threshold is low, CCMine 
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is faster more than 100 times compared with CCFilter, e.g., with minsup 0.05%, 
CCFilter uses 20 seconds to finish while CCMine only uses 0.2. The reason why 
CCMine is superior to CCFilter is that CCFilter has to find all of the support 
closed patterns although many of them do not satisfy the minimum confidence 
threshold and the number of these patterns increases a lot as the minimum 
support threshold decreases. Figure 4(b) shows the performance on the gazelle 
dataset when minsup is fixed at 0.01% and miri-a varies. As shown in the 
figure, CCMine always outperforms CCFilter and the execution times of CCMine 
increases very slowly while min.a decreases. CCFilter almost does not change 
while miri-a varies, which means it does not take any advantage from min^a. 
This is because it spends most of processing time on mining SCP. 

Now, we conduct the experiments on the pumsb dataset, which is a dense 
dataset. Figure 5(a) shows the execution time on the pumsb dataset when minjy. 
varies while minsup is fixed at 60%. Figure 5(a) shows that CCMine method 
outperforms CCFilter method when minsup is less than 60%. When minsup 
becomes less then 50%, CCFilter run out of memory and cannot finish. Figure 
5(b) shows that CCMine method always outperforms CCFilter method over entire 
range of min_a. 





(a) min_a = 25%. (b) minsup = 0.01%. 

Fig. 4. Execution time on gazelle data set 



In summary, experimental results show that the number of confidence closed 
correlated patterns are quite small in comparison with that of the support-closed 
patterns. The CCMine method outperforms CCFilter especially when the support 
threshold is low or the confidence threshold is high. 

6 Conclusions 

In this paper, we have presented an approach that can effectively reduce the 
number of correlated patterns to be mined without information loss. We pro- 
posed a new notion of confidence-closed correlated patterns. Confidence-closed 
correlated patterns are those that have no proper superset with the same support 
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min_sup(%) 




90 80 70 60 50 40 30 20 10 

min_conf(%) 



(a) when min^a = 60%. 



(b) when min^sup = 50%. 



Fig. 5. Execution time on the pumsb dataset. 



and the same confidence. For efficient mining of those patterns, we presented the 
CCMine algorithm. Several pruning methods have been developed that reduce 
the search space. Our performance study shows that confidence-closed, correlated 
pattern mining reduces the number of patterns by at least an order of magnitude 
in comparison with correlated (non-closed) pattern mining. It also shows that 
CCMine outperforms CCFilter in terms of runtime and scalability. Overall, it indi- 
cates that confidence-closed pattern mining is a valuable approach to condensing 
correlated patterns. 

As indicated in the previous studies of mining correlated patterns, such as 
[6,5,8], alLconfidence is one of several favorable correlation measures, with null- 
invariance property. Based on our examination, CCMine can be easily extended 
to mining some correlation measures, such as coherence or bond [6,5,8]. It is an 
interesting research issue to systematically develop other mining methodologies, 
such as constraint-based mining, approximate pattern mining, etc. under the 
framework of mining confidence-closed correlated patterns. 
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Abstract. Measuring the similarity between objects described by cate- 
gorical attributes is a difficult task because no relations between categor- 
ical values can be mathematically specified or easily established. In the 
literature, most similarity (dissimilarity) measures for categorical data 
consider the similarity of value pairs by considering whether or not these 
two values are identical. In these methods, the similarity (dissimilarity) 
of a non-identical value pair is simply considered 0 (1). In this paper, 
we introduce a dissimilarity measure for categorical data by imposing 
association relations between non-identical value pairs of an attribute 
based on their relations with other attributes. The key idea is to mea- 
sure the similarity between two values of a categorical attribute by the 
similarities of the conditional probability distributions of other attributes 
conditioned on these two values. Experiments with a nearest neighbor 
algorithm demonstrate the merits of our proposal in real-life data sets. 

Keywords: Dissimilarity measures, categorical data, conditional prob- 
ability, hypothesis testing. 



1 Introduction 

Dissimilarity measures between data objects are crucial notions in data mining, 
and measuring dissimilarities between data objects is important in detecting 
patterns or regularities in databases. It is compulsory for distance-based data 
mining methods such as distance-based clustering [1,2,3], nearest neighbor tech- 
niques [4,5, 6, 7], etc. 

Measuring (did)similarities between categorical data objects is a difficult task 
since relations between categorical values can not be mathematical specified or 
easily established. This is often the case in many domains in which data are 
described by a set of descriptive attributes, which are neither numerical nor 
ordered in any way. 

In real-life data sets, categorical attributes are considered as discrete random 
variables and often depend on each other. For example, the independence confi- 
dence level of the attributes changes in Lym and block of affere in the data set 
Lymgraphy is less than 1% when applying the t^st on their contingency table 
(Table 1). 



H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 580—589, 2004. 
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Table 1. Contingency table of the attributes change in node and the lymphatic 
block of affere block of affere 



changes 


in lym 


no 


yes 


sum 


changes in lym 


no 


yes 


sum 


bean 




5 


1 


6 


— >■ bean 


.83 


.17 


1 


oval 




42 


35 


77 


oval 


.55 


.45 


1 


round 




19 


46 


65 


round 


.29 


.71 


1 



Dependencies among attributes lead to relations between a value Vi of a cat- 
egorical attribute A* and the conditional probability distribution {cpd) of other 
attributes when the attribute A* takes the value Vi. Hence, relations between 
values of a categorical attribute can be obtained from dissimilarities between 
the cpds of other attributes corresponding to the values. For example, the cpds 
of the attribute block of affere are (83% :yes, 17%: no), (55% :yes, 45%: no), 
and (29%: yes, 71% :no) when the attribute changes in node holds the values 
bean, oval, and round, respectively. The cpd of bloek of affere when the attribute 
changes in node holds the value bean is more similar to its cpd when the attribute 
changes in node holds the value oval than when it holds the value round. Hence, 
the dissimilarity between bean and oval should be smaller than the dissimilarity 
between bean and round. 

In this paper, we introduce a dissimilarity measure for categorical data by 
imposing relations between a value Vi of a categorical attribute A® and cpds of 
another attribute when the attribute A® takes the value Vi. The main idea here 
is to define the dissimilarity between values Vi and r;' of the attribute A® as the 
total sum of dissimilarities between cpds of other attributes when A® = Vi and 
A® = v[. Then, the dissimilarity between two data objects is considered as the 
total sum of dissimilarities of their attribute value pairs. 

This paper is organized as follow: related works are briefly summarized in 
Section 2. Conditional probability distribution-based dissimilarity (j>d) and its 
characteristics are presented in Section 3, while our experiments are described 
in Section 4. Conclusions and recommendations for further work are discussed 
in the last section. 



2 Related Works 

A common way to measure dissimilarities between categorical data objects is to 
transform them into binary vectors. Then, dissimilarities between data objects 
are considered as dissimilarities between corresponding binary vectors. 

(Did)similarity measures between binary vectors have been studied for many 
years, [8,9,10,11,12]. Most of these measures are based on the number of identical 
and non-identical value pairs. The similarity (dissimilarity) of a non-identical 
value pair is simply 0(1) and the similarity (dissimilarity) of an identical value 
pair is simply 1 (0). 
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Table 2. Some well-known similarity measures for binary vectors 





Definition 


Range 




Kendall, Sokal-Michener (1958) 


a-\-d 


[0,1] 


s 


Rogers and Tanimoto (1960) 


a-\-d 


[1,0] 


s 


Sokal and Sneath (1963), un^^,S 


b-\-c 

a-\-d 


[0,oo] 


s 


Hamman (1961) 


a-\-d — b — c 
1 


[-1,1] 


T 


Jaccard (1900) 


a 

a-|-b-|-c 


[1,0] 


T 


Dice (1945), Czekanowski (1913) 


a 

a+^{b+c) 


[1,0] 


T 


Sokal and Sneath 


a 

a + 2(b+c) 


[1,0] 


T 



Let n = {rii, . . . , ri/} be a value set of all attributes. A data object x is 
transformed into a binary vector X = (xi, . . . ,xi), Xi G {0, 1}, where = 1 if 
the object x holds the value Uj, and Xi = 0 otherwise. 

Denote X and Y as two binary vectors, XY = ^ ■ xtyt, X as the com- 
plementary vector of X: X = 1 — X = [1 — Xi], and the following counters: 
a = AY, b = XY , c = XY and d = XY . Obviously, a-|-l<-l-c-|-c? = I and 
a + b + c = m where m is the number of attributes. Using these notions, several 
measures on binary vectors have been defined. A brief summary is given in Table 
2 . 

Most of the popular practical similarity measures in the literature (Jaccard, 
Dice, etc) belong to one of these two families 



Se 



a + d 

a + d + 9{b + c) 



and Te 



a 

a + 6{b + c) 



as introduced by J.C Gower and P.Legendre (1986) [12]. 



3 Conditional Probability Distribution-Based 
Dissimilarity 



3.1 Notations 



Let A^, . . . , A™ be m categorical attributes and a data set D C x ... x A™. 
Let us denote 



— dom{A^)-. The domain of the attribute A*. 

— X = (xi, . . . , Xm),Xi G dom{A^): x G D 

— p(A* = Vi): the probability of the attribute A* takes value Vi. 



p(A* = Vi) 



|{x : Xj = Vj}\ 

\D\ 



— p{A^ = UjjA® = Vi): the conditional probability of the attribute A-l taking 
the value Vj given that the attribute A® holds the value Vi. 



p{A^ = Vj\A'‘ = Vi) 



p{A^ = Uj, A® = Vj) 
p(A® = Vi) 
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— cpd{A>\A^ = Vi) ■. the conditional probability distribution of the attribute 
given that the attribute A* holds the value vf. 

cpd{A^\A^ = Vi) = {p{A^ = Vj\A'‘ = Vi) : Vj € dom(A^)} 

— </)(., .) : a dissimilarity function of two probability distributions. 



3.2 Conditional Probability Distribution-Based Dissimilarity 

Given a value pair (vi,v') of the attribute A*, the conditional probability 
distribution-based dissimilarity (cpd) between Vi and is defined as the to- 
tal sum of dissimilarities between the cpd pairs of other attributes, given that 
the attribute A® holds the values Vi and v'f 

pd{vi,v'i) = ^ (j){cpd{A^\A'‘ = Vi),cpd{A^\A^ = v[)) (1) 

f jA* 

The conditional probability distribution-based dissimilarity {pd) from the 
data object x to the data object y is defined as the total sum of pds between 
their attribute value pairs. 



pd(x,y) = ^^pd{xi,yi) 

i 



( 2 ) 



For example, consider the instance in Table 1 and the Kullback-Leiber diver- 
gence [13,14] as the dissimilarity function .). 



X ^ \ / 

pds from bean to oval and from bean to round are computed as: 

0 8 *^ 0 17 

pd{bean, oval) = 0.S3 log(-— ) + 0.17 log(-— ) = 0.08. 

0.55 0.45 

0 8S 0 17 

pdibean, round) = 0.83 log(^^) + 0.17 = 0-27- 



3.3 Characteristics of pd 

In this subsection, we discuss a property and two important notes of pd in 
practice. 



Proposition 1. If all attribute pairs are independent, then conditional prob- 
ability distribution-based dissimilarities between data object pairs are all equal 
to 0. 
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Proof. Since all attribute pairs are independent, we know that: 

\/A\A^,Vi,v'^ € dom{A^),Vj S dom{A^) : 

p{A^ = Vj\A^ = Vi) = p{A^ = Vj) = p{A^ = Vj\A^ = ?;') 
=> cpd{A^\A^ = Vi) = cpd{A^\A^ = r>') 

=> </)(cpc?(A'^ 1^* = z;j), = v[)) = 0 

^ pd{vi,v'i) = 0 
^ pd{^,y) = 0 Vx,y G D. 



The following two notes regarding pd are important in practice. 

1. pd follows the properties of the 4 >{., .) function. 

2. Since the number of attribute values of a data set is usually small, it is 
possible to store temporal data such as cpd{A^\A^ = Vi), P{A’' = Vi), etc., in 
memory after one scan for the whole data set. Hence, the computing time 
for a value pair of an attribute depends mainly on the number of attribute 
values and is linear with data set sizes. 

4 Experiments 

In this section, we conduct two experiments on pd. First is to analyze pd on 
two data sets. Vote and Monks, from UCI [15]. The second experiment is to 
employ pd in a nearest neighbor classifier (NN)[4j, where dissimilarities between 
data objects play a crucial role, in order to show its usefulness in practice. 
To evaluate performance of the proposed similarity measure, we compare the 
accuracies of NN using pd with those of NN with Jaccard, employing 10-trial 
10-fold cross-validation strategy. In our experiments, we employ the Kullback- 
Leiber divergence as the distribution similarity function </>(.,.) in Equation 1. 



4.1 Analysis of pd on the UCI Data Sets 
pd on the Data Set Vote 

Table 3 presents dissimilarities between value pairs of the data set Vote. As can 
be seen from this table, 

— Dissimilarities between value pairs are nonnegative, zero on diagonals, and 
asymmetrical. These properties are attributed to the nature of KL diver- 
gence. 

~ Dissimilarities between value pairs of an attribute are not simply 0 or 1. For 
example, on the attribute immigration, the dissimilarity between yes and 
nc is 16.26, the dissimilarity between yes and no is 0.20, the dissimilarity 
between nc and no is 15.11, etc. 
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Table 3. Dissimilarities between attribute values of Vote 
handicapped water-project budget-resolution physician-fee 





nc 


no 


yes 


nc 


no 


yes 


nc 


no 


yes 


nc 


no 


yes 


nc 


0 


9.77 


7.89 


0 


2.34 


1.86 


0 


13.22 


10.20 


0 


10.99 


17.21 


no 


7.68 


0 


3.68 


1.27 


0 


0.71 


7.77 


0 


10.17 


9.50 


0 


13.38 


yes 


4.63 


3.47 


0 


1.32 


0.75 


0 


7.42 10.83 


0 


16.08 11.54 


0 



mx-missile immigration synfuels cutback education- spending 





nc 


no 


yes 


nc 


no 


yes 


nc 


no 


yes 


nc 


no 


yes 


nc 


0 


8.25 


2.71 


0 


15.11 17.67 


0 


6.69 


5.78 


0 


4.37 


7.32 


no 


8.03 


0 


11.42 


16.15 


0 


0.21 


3.68 


0 


0.88 


3.06 


0 


10.60 


yes 


1.97 10.60 


0 


16.26 


0.20 


0 


5.29 


9.79 


0 


4.45 


8.98 


0 




Fig. 1. Some conditional probability distributions when given the attribute immigra- 
tion values nc, yes, and no 



— Dissimilarities between value pairs of an attribute are different from pair 
to pair. The differences are due to various differences between the cpds 
of other attributes corresponding to values of this attribute. For example, 
(on the attribute immigration) cpds of other attributes when the attribute 
immigration takes the value of yes are similar to cpds when it takes the 
value of no and also different from those when it takes the value of nc (some 
examples are given in Fig. 1). Hence, the dissimilarity between the values 
yes and no, 0.20, is much smaller than the dissimilarity between the values 
yes and nc, 16.26, or the dissimilarity between the values no and nc, 16.15. 



pd on the Data Set Monks 

Table 4 shows dissimilarities between value pairs of the data set Monks. As shown 
in this table, dissimilarities between value pairs of an attribute are all zero. Since 
all attribute pairs are independent (an independent test of this data set is shown 
in Subsection 4.2), the above observation can be explained by Proposition 1. 
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Table 4. Dissimilarities between attribute values of Monk 
A1 A2 A3 A4 A5 A6 





1 


2 


CO 




1 


2 


CO 


1 


2 




1 


2 


CO 




1 


2 


3 


4 




1 


2 


1 


0 


0 


0 


1 


0 


0 


0 


1 0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


0 


1 


0 


0 


2 


0 


0 


0 


2 


0 


0 


0 


2 0 


0 


2 


0 


0 


0 


2 


0 


0 


0 


0 


2 


0 


0 


CO 


0 


0 


0 


3 


0 


0 


0 






3 


0 


0 


0 


3 


0 


0 


0 


0 




































4 


0 


0 


0 


0 









4.2 Nearest Neighbor with pd 

To analyze how well pd boots the performance of the nearest neighbor algorithm 
[4] compared with Jaccard, we use the hypothesis testing: 

Ho : Ml = M2 

Hi : Pi > p2 



where mi and M 2 are the average accuracies of NN with pd and NN with Jaccard, 
respectively. 

Since each 10-trial 10-fold cross-validation result contains 100 trials, the dif- 
ference between two means mi and M 2 follows the normal distribution: 



Z = 



Ml - M2 






iL 

100 



-b 



100 



where and 62 are the deviations of the accuracy test results of NN with pd 
and Jaccard. 

Since pd bases strongly on dependencies between attributes, we analyze de- 
pendencies of attribute pairs by the test with the 95% significant level. 



X^(A\A^) = 

E . E . 

Vi^dom{A'^) Vj^dom{A^) 



[p{A^ = Vi, A^ = Vj) - \D\ p{A^ = Vi) p{A^ = Vj )] ' 
\D\ p{A^ = Vi) = Vj) 



with the degree of freedom: \dom{A'^)\ \dom{A^) \ — 1. 

The dependent factor of D, denoted p{D), is defined as the proportion be- 
tween the number of dependent attribute pairs and the total number of attribute 
pairs. 



P{D) 



\{{A^,A^) : A'- and A> are dependent}! 
m(m — 1) 



4.3 Results 

We employ the nearest neighbor algorithm (NN) with pd and with Jaccard on 
15 data sets from UCI [15]. Discretization for numeric attributes of the data sets 
is done automatically by the data mining system CBA [16]. 
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Table 5. The accuracies of nearest neighbor algorithm with pd and with Jaccard 
similarity 



pd Jaccard 



No. 


Name 


P(%) m(%) 


<5i 


M2(%) 


<52 


Z 


1 


splice 


100 


87.28 


0.0167 


75.43 


0.0260 


38.30 


2 


tictactoe 


100 


96.61 


0.0185 


81.01 


0.0382 


36.76 


3 


waveform 


81 


77.11 


0.0148 


71.58 


0.0196 


22.56 


4 


crx 


99 


83.04 


0.0401 


78.38 


0.0441 


7.81 


5 


anneal 


63 


99.60 


0.0081 


98.69 


0.0127 


6.05 


6 


pima 


53 


71.62 


0.0472 


67.33 


0.0566 


5.82 


7 


wine 


100 


99.83 


0.0095 


98.15 


0.0276 


5.76 


8 


cmc 


97 


45.90 


0.0389 


43.59 


0.0383 


4.24 


9 


hypo 


98 


98.97 


0.0055 


98.61 


0.0071 


4.09 


10 


vote 


100 


94.03 


0.0360 


92.14 


0.0425 


3.39 


11 


labor 


99 


97.53 


0.0719 


94.47 


0.0880 


2.70 


12 


ZOO 


98 


98.02 


0.0399 


96.43 


0.0500 


2.49 


13 


post-operative 


100 


60.22 


0.1695 


55.78 


0.1787 


1.80 


14 


sonar 


11 


79.38 


0.0856 


75.56 


0.0884 


-3.10 


15 


monks 


0 


49.56 


0.0772 


76.86 


0.0576 


-28.34 



Table 5 shows the results of NN with pd and NN with Jaccard for the 15 
data sets. The first column presents the indices of the data sets in the decreas- 
ing order of Z values. The second and third columns show the names and the 
dependent factors of the data sets. The fourth and fifth columns present accu- 
racies and deviations of NN with pd. The next two columns present accuracies 
and deviations of NN with Jaccard. The last column shows Z values. 

As can be seen from Table 5: 

~ The accuracies of NN with pd are higher than the accuracies of NN with 
Jaccard on the first 13 data sets. The confidence levels are more than 99% 
{NormDist{Za= 2 . 49 ) = 0.994) for the first 12 data sets and more than 96% 
\NormDist\Za = 1.80) = 0.964) for the last one. 

— The accuracies of NN with pd are lower than the accuracies of NN with 
Jaccard with a more than 99% confidence level {NormDist{Za = —3.1) < 
0.01) for the last two data sets,. 

From Table 5, it is noteworthy that: 

— The first 13 data sets, for which the accuracies of NN with pd are higher 
than with Jaccard, have also high dependent factors (i.e.. Vote 100%, Splice 
100%, waveform 81%). 

— The last two data sets, in which the average accuracies of NN with pd are 
lower than with Jaccard, have low dependent factors (sonar 11%, monks 
0 %). 

The note indicates that, in practice, pd is suitable for dependent attribute 
data sets which have large dependent factors, and unsuitable or useless for inde- 
pendent attribute data sets those with small dependent factors. This is because 
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relations between the attribute value pairs of an attribute are obtained from 
the dependencies of other attributes when given the values. In other words, the 
relations are based upon the dependencies of other attributes on this attribute. 
Hence, when there are weak dependencies or no dependencies between attributes, 
the obtained relations must be very poor and lead to the inefficiency or poor 
performance of pd in practice. 

5 Conclusion 

In this paper we introduce a new dissimilarity measure for categorical data based 
on dependencies between attributes. The main idea here is to define the dissim- 
ilarity between values {vi and ?;') of the attribute as the total sum of dissim- 
ilarities between cpds of other attributes when the attribute holds the values 
Vi and r>'. The dissimilarity between two data objects is considered as the total 
sum of dissimilarities of their attribute value pairs. Since pd strongly relies on 
dependencies between attributes, independent attribute pairs can be considered 
as ’noises’ in pd. Thus, in future studies, we will try to limit the effects of in- 
dependent attribute pairs as much as possible. Moreover, pd does not directly 
reveal (dis)similarities between value pairs but rather, relations between parti- 
tions (clusters) of data generated by the value pairs. This led to our approach 
of using this idea for clustering categorical data. 
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Abstract. To locate information embedded in documents, information 
extraction systems based on rule-based pattern matching have long been 
used. To further improve the extraction generalization, hidden Markov 
model (HMM) has recently been adopted for modeling temporal vari- 
ations of the target patterns with promising results. In this paper, a 
state-merging method is adopted for learning the topology with the use 
of a localized Kullback Leibler (KL) divergence. The proposed system 
has been applied to a set of domain-specihc job advertisements and pre- 
liminary experiments show promising results. 



1 Introduction 

Information extraction (IE), originated from the natural language processing 
community, has a history of more than a decade, with the goal of extracting tar- 
get information (normally in the form of some short text segments) from a large 
collection of documents. Many related pattern matching systems have been built 
[1] and the extraction rules can automatically be induced [2]. The pattern match- 
ing approach only works well for extracting information with some distinctive 
attributes or a limited vocabulary size. To further improve the extraction gener- 
alization, hidden Markov model (HMM) has recently been adopted for modeling 
and extracting the temporal variations of the target patterns with promising 
results. 

Hidden Markov model (HMM) is a prevalent method for modeling stochastic 
sequences with an underlying finite-state structure. To apply HMM to infor- 
mation extraction, the model topology should first be determined before the 
corresponding model parameters can be estimated. In this paper, we adopt a 
state-merging method to learn the HMM topology, with the learning process 
guided by the Kullback Leibler (KL) divergence. As learning the optimal struc- 
ture is known to be computationally expensive, the use of a local KL divergence 
as well as some domain-specific constraints have been explored. The proposed 
system has been applied to extracting target items from a set of domain-specific 
job advertisements and preliminary experiments show promising results regard- 
ing its generalization performance. 

* This research work is jointly supported by HKBU FRG/OO-Ol/H-8 and HKBU 
FRG/02-03/H-69. 

H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 590—594, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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<city>Austin</city> company is seeking <area>multimedia</area> 
and <area>Internet</area> programmers for exciting web site and 
<area>multimedia</area> software production! Successful candidates 
will have <req_degree>B . S . </req_degree> degree, preferably in Computer 
Science, and <req_years_experience>2 </req_years_experience> years of 
professional experience. 



Fig. 1. A tagged job ad. 



2 HMM for Information Extraction 

HMM has recently been applied to information extraction and found to work 
well in a number of application domains, including gene names and locations 
extraction from scientific abstracts [3], extracting relevant information from re- 
search papers [4], etc. The HMM we are studying is a discrete first-order one 
with four types of states, namely background states (BG), prefix states (PR), 
target states (TG) and suffix states (SU). Only the target states are intended 
to produce the tokens we want to extract. Prefix and suffix states are added be- 
cause the context immediately before and after the targets may help identifying 
the target items. Background states are for producing the remaining tokens in 
the document. Figure 1 shows a typical training example under the context of 
job advertisements. Based on the formulation, we can hand-draft a HMM with 
the four types of states as mentioned above and learn its parameters based on 
the tagged information. 

Automatically constructing an optimal model topology is known to be chal- 
lenging. In this paper, a bottom-up approach is adopted (see [5,6] for related 
works) . ft starts with a complex HMM as a disjunction of all the training exam- 
ples (see Figure 2) and induces the model structure with better generalization by 
merging states, where some measure to determine whether to merge two states 
is typically required. 



vfpcnvftcv truh aM.-ythly u/kitr dos & hva is a piuii 




Fig. 2. An initial model constructed by two training examples (S - start state, F - 
final state, TG - target state, BG - background state, PR - prefix state and SU - suffix 
state). 
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3 KL Divergence Based State Merging 

3.1 KL Divergence between HMMs 

To determine whether two states should be merged, a distance measure between 
the model before and after modeling is needed. Kullhack Leibler (KL) divergence 
D{A\\B) (also called relative entropy Ha.b{X)) can be used. The measure has 
the property that D{A\\B) = 0 if and only if the model A and B are identical. 
It is defined as 



D{A\\B) = Ha.b{X) = - ^ pa(x,) log (1) 



To compute the KL divergence between two HMMs, one can generate a long 
data sequence using each of the models and compute the divergence empirically 
according to Eq.(l). This empirical approach is known to be time-consuming, 
and yet not accurate. An alternative method is to compute directly the KL 
divergence between two models based on the model parameters. Carrasco et al. 
[7] have derived an iterative solution for computing the KL divergence between 
two probablistic deterministic finite state automata (PDFA), given as 



D{A\\B) = 



E E E CzjPA{qz,cr) log 

Qi&QA 



PA{qi,cr) 

PB{qj,cr) 



( 2 ) 



where Qa and Qb are the sets of states of PDFA A and PDFA B respectively. The 
coefficients Cij are evaluated iteratively with guaranteed convergence. Related 
KL divergence evaluation method is used by Tholland et al. [8] to infer the 
structure of probablistic deterministic finite automata (PDFA). 

Inspired by the MDI algorithm, we have attempted to apply a similar ap- 
proach for inducing HMM’s topology. However, it turns out that a similar itera- 
tive solution does not exists for HMM. The main reason which makes Carrasco’s 
[7] work on PDFA but not applicable to HMM is that every PDFA can gener- 
ate a stochastic deterministic regular language (SDRL). An HMM, however, can 
generate an observed sequence via multiple possible paths of state transitions 
and this makes it non-deterministic. So, it is still an open research issue for 
computing the KL divergence between all kinds of HMMs efficiently. In the fol- 
lowing subsection, we propose a learning algorithm that compares the entropies 
of HMMs in an indirect manner. 



3.2 The Proposed Learning Algorithm 

To compute the KL divergence between the two HMMs, we first convert them to 
probabilistic deterministic finite state automata (PDFA) using the well-known 
subset construction algorithm. Then, the KL divergence between the two con- 
verted PDFAs can be computed (using Carrasco’s solution [7]) as the approxi- 
mated value of the KL divergence of the two HMMs. In addition, based on the 
observation that the change in the structure of the consecutive model candidates 
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during the state merging process are normally localized, we suggest to further 
restrict the KL divergence comparsion to only the local structure of the mod- 
els (see Figure 3). With that arrangement, further speed-up is resulted, which 
however has to trade off with the sacrificed accuracy. 

The searching strategy in the topology space greatly affects the computa- 
tional complexity of the corresponding algorithm. In our proposed algorithm, 
instead of randomly picking two states to merge as practised by Stolcke in [5], 
we adopt a search strategy similar to that of MDI [8], which is more efficient 
via the use of heuristics. Furthermore, an orthogonal direction to address the 
search complexity is by adding domain-specific constraints to narrow down the 
search space. For information extraction, the states of the HMM are correspond- 
ing to the different target information. Constraints of allowing only the states 
with same labels to be merged can be incorporated, to pursue [6]. Lastly, as the 
actual emission vocabulary is much larger than that of the training set, absolute 
discounting smoothing is adopted to alleviate the problem. 




Fig. 3. Two new HMMs constructed by merging two candidate states. 



4 Experimental Results 

In order to evaluate the proposed topology learning algorithm, we have tested it 
using a data set containing 100 computer-related job ads. postings Each of them 
contains more than 200 words. A ten-fold cross-validation is adopted to reduce 
the sampling bias. According to Table 1, the proposed HMM topology learning 
system outperforms the well-known RAPIER system [2] as well as a baseline 
HMM model which merges ALL states of the same type in the initial model. 
Also, the results show that even a simple baseline HMM model can perform 
much better than the rule-based system. This demonstrates the advantage of 
the high representation power of HMM. Also, the results show that the use 
of HMM topology learning algorithm instead of the baseline HMM model and 
that the proposed local KL divergence is and effective measure for guiding the 
learning process. Regarding stemming, according to our experimental results, it 
makes no significant difference to the overall performance. We guess the effect 
of stemming may reveal when the training size increases. 
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Table 1. Performance comparision in extracting the field “language”. of training 
= 10, # of validation = 10, # of testing = 80) 



System 


Initial no. 
of states 


Final no. 
of states 


Stem 

7 


Precision 

(%) 


Recall 

(%) 


F-Measure 

(%) 


Baseline HMM 


530 


4 


Y 


66.3 


44.9 


53.4 


Baseline HMM 


530 


4 


N 


66.7 


44.7 


53.4 


Proposed System (a=0.35) 


530 


21 


Y 


89.7 


55.3 


68.4 


Proposed System (o:=0.35) 


530 


14 


N 


87.6 


56.4 


68.6 


RAPIER* 


- 


- 


- 


50.8 


25.0 


33.5 



5 Conclusions 

An HMM topology learning algorithm using a local KL divergence has been 
proposed in this paper. Preliminary experimental results are encouraging for 
applying the methodology to information extraction tasks. In particular, it out- 
performs a simple HMM model and also the well-known RAPIER system. For 
future work, more extensive experiments should be conducted for performance 
comparision. Also, how to further speed up the learning time is also crucial for 
making the proposed algorithm to be able to scale up. One important issue in the 
proposed learning algorithm is how to set the threshold for accepting a candidate 
merge. More research effort along the direction is needed. 
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Abstract. Many representation schemes for time series have been pro- 
posed and most of them require predefined parameters. In case of clas- 
sification, the accuracy is considerably influenced by these predefined 
parameters. Also, the users usually have difficulty in determining the pa- 
rameters. The aim of this paper is to develop a representation method for 
time series that can automatically select the parameters for the classifica- 
tion task. To this end, we exploit the multi-scale property of wavelet de- 
composition that allows us to automatically extract features and achieve 
high classification accuracy. Two main contributions of this work are: 
(1) selecting features of a representation that helps to prevent time se- 
ries shifts, and (2) choosing appropriate features, namely, features in an 
appropriate wavelet decomposition scale according to the concentration 
of wavelet coefficients within this scale. 



1 Introduction 

Many algorithms have been proposed for mining time series data [14]. Among 
the different time series mining domains, time series classification is one of the 
most important domains. Time series classification has been successfully used in 
various application domains such as medical data analysis, sign language recog- 
nition, speech recognition, etc. For efficiency and effectiveness, most of the pro- 
posed methods classify time series on high level representations of time series in- 
stead of classifying the time series directly. These representations include Fourier 
Transforms [1], Piecewise Linear Representation (PLR) [15], Piecewise Aggre- 
gate Approximation [13,17], Regression Tree Representation [9], Haar Wavelets 
[4], Symbolic Representation [16], etc. 

Most of the proposed representation schemes in the literature require prede- 
fined parameters. In case of classification, the accuracy is considerably influenced 
by these predefined parameters. The selection is not trivial or easy for the users, 
and as a consequence, the users usually have difficulty in determining the pa- 
rameters. 

In this paper we introduce a time series representation by Haar wavelet de- 
composition. The problem of time series classification is tackled by means of 
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selecting features of the representation. We propose a novel feature extractor 
to extract the approximation and change amplitudes of time series. The Eu- 
clidean distance between the features of shifted time series and original time 
series is smaller than those between the raw data. The appropriate features, 
i.e., features within the appropriate wavelet decomposition scale, are chosen by 
the concentration of features. The appropriate features also have been used for 
noise reduction. Corresponding time series classification algorithms will also be 
addressed. 

The rest of this paper is organized as follows. Section 2 briefly discusses back- 
ground material on time series classification and Haar wavelet decomposition. 
Section 3 introduces our time series representation and classification algorithms. 
Section 4 contains a comparative experimental evaluation of the classification ac- 
curacy with the proposed approach and other approaches. Section 5 introduces 
the related work. Section 6 gives some conclusions and suggestions for future 
work. 

2 Background 

In order to frame our contribution in the proper context we begin with a re- 
view of the concept of time series classification and the process of Haar wavelet 
decomposition . 

2.1 Time Series Classification 

A time series is a sequence of chronological ordered data. Time series can be 
viewed as a series of values of a variable that is a function of time t, that is, 
A't = For simplicity, this paper assumes the observed values 

are obtained with equal time interval 1. 

Given a training time series data set D = {di, d 2 , ■ ■ ■ , dn}, where each sample 
di is a time series associated with a value of class attributes c G C. The task 
of time series classification is to classify a new testing time series x by a model 
constructed on D. 

2.2 Haar Wavelet Decomposition 

Wavelet transform is a domain transform technique for hierarchically decompos- 
ing sequences. It allows a sequence to be described in terms of an approximation 
of the original sequence, plus a set of details that range from coarse to fine. The 
property of wavelets is that the broad trend of the input sequence is preserved 
in approximation part, whereas the localized changes are kept in detail parts. 
No information will be gained or lost during the decomposition process. The 
original signal can be fully reconstructed from the approximation part and the 
detail parts. 

Haar wavelet is the simplest and most popular wavelet given by Haar. The 
benefit of Haar wavelet is its decomposition process has low computational com- 
plexity. Given a time series with length n, where n is an integral power of 2, the 
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complexity of Haar decomposition is 0{n). The concrete mathematical founda- 
tion of Haar wavelets can be found in [3] . 

The length of input time series is restricted to an integer power of 2 in the 
process of wavelet decomposition. The series will be extended to an integer power 
of 2 by padding zeros to the end of time series if the length of input time series 
doesn’t satisfy this requirement. Therefore, the number of decomposing scales 
for the input time series is log 2 {n), here n is the length of zero-padded input 
series. 

The structure of decomposed hierarchical coefficients are shown in Table 1. 



Table 1. The hierarchical wavelet coefficients 



Scale 


Coefficients 


0 


^0 


1 


Ai 




2 


A 2 


D 2 








D 2 




log2(«) 


Ak\Dk\. ■ ■ 


D 2 





Scale 0 is the original time series. To calculate the approximation coef- 
ficients Aj = {aoj, oij, . . . , and detail coefficients Dj = {doj,dij 

, . . . ,dn-ij} within scale j, the approximation part of scale j — 1 is di- 
vided into an even part including even indexed samples: even{Aj_i) = 
{aoj-i, « 2 j-i, • • ■ ,a 2 n- 2 ,j-i} and an odd part including odd indexed samples: 
odd(Aj-i) = {oi j_i, osj-i, . . . , a 2 ra-i j-i}- Approximation coefficients Aj are 
calculated as 

Aj = ^{even{Aj_i) + odd{Aj_i)). 

The detail coefficients within scale j are calculated as 
^3 = ^(e^en(Aj_i) - odd{Aj_i)). 

3 Appropriate Feature Extraction for Wavelet Based 
Classification 

We propose a scheme of wavelet based classification algorithm shown as follows: 

1 . Decomposing the training data set and a testing data with wavelets; 

2. Extracting features from the wavelet coefficients; 

3. Constructing a model on the features of training data set; 

4. Classifying the testing data by the model; 

The performance of classification will depend on the feature extractor and 
the classification algorithm. We introduce a time series representation and cor- 
responding feature extraction in section 3.1 and section 3.2. We suggest two 
different classification algorithms in section 3.3 and section 3.4. 
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3.1 Time Series Representation and Feature Extraction 

The concatenation of decomposed wavelet coefficients of a time series X = 
{xq,xi, . . . ,Xn-i} to a particular scale k G {1,2, . . . ,log 2 {n)} shown in (1) is 
a representation of X. 









X 






f] 



( 1 ) 



The Euclidean distance of two time series X = {xq,x\, . . . ,Xn-i} and Y = 
{yo,yi,---,yn-i} is 



Disc{x,Y) = - yiY^ 

and the Euclidean distance Disc(W^ ,W^) between corresponding wavelet co- 
efficients in a particular scale k transformed by X and Y is 






- a-i,k 






The Euclidean distance is preserved through Haar wavelet transform, i.e., 
Disc{X,Y) = Disc{W^,W^) [4], 

Note that the detail coefficients = {d,Qj,dij, . . .} contain local changes 
of time series. Thus \D^\ = {|doj |, |di j |, . . .} denote the amplitude of local 
changes. The concatenation of decomposed wavelet approximation coefficients 
and the absolute value of decomposed wavelet detail coefficients |Z)^|,Vj = 
1,2, ... ,k to a particular scale k, k G {1,2,... log 2 {n){ of a time series X are 
defined as features shown as following. 






k 1 



\D^\ 



D^\,\D 



X\ 



( 2 ) 



This definition helps to overcome the well-known problem posed by the fact 
that wavelet coefficients are sensitive to shifts of series. 

The Euclidean distance Disc{F^ , F^) between the features of two time series 
X and Y can be defined as 

v'E.Kfc - + j:U - \dh\r- 

Because \x\ — \y\ < x — y, we obtain Disc{F^ ,F^) < Disc{W^ , W^)) = 
Disc{X,Y). If X and Y denote the original time series and shifted time series 
respectively, this inequation is still tenable. 



3.2 Appropriate Scale Selection 

Notice that with wavelet decomposition, we have multiple choices of features by 
the multi-scale property of wavelets. It is intuitive to find out which features 
by a given scale should better than others for classification. If the energy of a 
scale is concentrated in a few coefficients then only some important coefficients 
can represent the whole coefficients with low error. This scale may give valuable 
information for classification. 
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The Shannon entropy which is a measure of impurity within a set of instances 
can describe the concentration of coefficients within a scale. The Shannon en- 
tropy is defined in (3) 

Hp = -^p^log2P^ (3) 

I 

The appropriate decomposing scale is defined as the scale with the lowest 
entropy. The appropriate features of a time series is defined as the wavelet coef- 
ficients within appropriate decomposing scale. 

Appropriate scale = argmin(— E P».fclog2Pi.fe) (4) 

k 

I 

I I 

here Pi k = i is the proportion between the absolute value of a coefficient 

’ 2^2 = 1 1-^2, fcl 

in a feature and the sum of absolute values of whole feature, pi^k has the property 
YjiPi,k = 1 and pi^k > 0. 



3.3 Classification Algorithm 

For two series with the same length, their corresponding appropriate scales may 
not be equal. We can’t compare the similarity of two set of appropriate fea- 
tures directly because the meaning of each data entry is different. For exam- 
ple, given a time series X = {xq,Xi,X 2 ,x^} with appropriate scale 1 and a 
time series Y = {yoj 2/ii 2 / 2 , J/a} with appropriate scale 2. The features of se- 
ries X are Fx{X) = {oop, ai,i, |do,i|, Mi,i|} and the features of series Y are 
F 2 {Y) = {ao, 2 , MopI, Mo,i|) Mi,i|}- Comparing detail coefficients with approxi- 
mation coefficients will induce errors and meaningless. 

To avoid this problem, we merge the distance of features within different 
appropriate scales. The classification algorithm will be implemented by means of 
1-nearest neighbor algorithm, which we called WCANN (Wavelet Classification 
Algorithm based on 1-Nearest Neighbor). 

Table 2 shows an outline of the classification algorithm. The input is S (the 
training time series set) and x (a new emerged testing time series). The output 
is Xc, the label of x. 

X will be labeled as a member of a class if and only if, the distance between 
X and one instance of the class is smaller than between other instances. 

The distance of two features is replaced by the average of distance com- 
puted on two features with different appropriate scale respectively. The distance 
between features of time series X and features of time series Y is denoted as 

Disc{F^,F^) = {DisciF^, F^) + Disc{F^,F^ ))/2, 

where m is the appropriate scale of features X and n is the appropriate scale of 
features Y. 
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Table 2. The WCANN algorithm (S, x) 

For each training example Si , calculating its appropriate scale rm and corresponding 
appropriate features ; 

Given a testing instance x, calculating its appropriate scale n and appropriate fea- 
tures F^; 
best-so-far = inf; 
for i = 1 to length(5) do 

Disc{F^*.,F^.) = distance between F®’ and F^ on scale mf, 

Disc{F„* , F^) = distance between F^‘‘ and F^ on scale n; 

Disc{F^\F^) = {Disc{F^^,F^J + Disc{F^' , F^))/2 
if Disc{F^' , F^) < best-so-far then 
pointer-to-best-series k = i-, 
best-so-far = Disc{F^' , F^)\ 
end if 
end for 

return Xc = the label of k\ 



3.4 Noise Reduction on Features 



The idea of wavelet noise shrinkage is based on the assumption that the ampli- 
tude of the spectra of the signal to be as different as possible for that of noise. 
If a signal has its energy concentrated in a small number of wavelet coefficients, 
these coefficients will be relatively large compared to the noise that has its en- 
ergy spread over a large number of coefficients. This allows thresholding of the 
amplitude of the coefficients to separate signals or remove noise. The threshold- 
ing is based on a value r that is used to compare with all the detail coefficients. 
The definition of appropriate scale defined in section 3.2 helps to reduce noise 
because the energy of signal gets concentrated in a few coefficients and the noise 
remains spread out in that scale. Hence it is convenient to separate the signal 
from the noise by keeping large coefficients (which represent signal) and delete 
the small ones (which represent noise). 

Donoho and Johnstone [8] gave the threshold r = anV^logN, here the cr„ is 
the standard variation of noise, and N is the length of the time series. Because 
we don’t know the an of the time series in advance, we estimate it by robust 
median estimation of noise introduced in [8]. The robust median estimation is 
the median absolute deviation of the detail wavelet coefficients at scale one, 
divided by 0.6745. 

Usual hard thresholding algorithm is used in this paper that is a process of 
setting the value of detail coefficients whose absolute values are lower than the 
threshold to zero [7]. The hard thresholding algorithm for features defined in (4) 
is illustrated in (5). 



Thre{\D,j\) 



\Dij\, \Dij\ > T 

0 , \Dij\ < T 



( 5 ) 



The classification algorithm for noise reduced coefficients WCANR (Wavelet 
Classification Algorithm with Noise Reduction) is similar with WCANN algo- 
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Table 3. The error rates for various feature extraction 



Approach 


CBF 


CC 


WCANN 


0.0026 


0.0067 


WCANR 


0.0026 


0.0067 


WC-NN 


0.0026 


0.0133 


1-NN 


0.0026 


0.0133 


Euclidean Distance 


0.003 


0.013 



Table 4. The error rates for various similarity measures 



Approach 


CBF 


CC 


Euclidean Distance 


0.003 


0.013 


Aligned Subsequence 


0.451 


0.623 


Piecewise Normalization 


0.130 


0.321 


Autocorrelation Functions 


0.380 


0.116 


Cepstrum 


0.570 


0.458 


String (Suffix Tree) 


0.206 


0.578 


Important Points 


0.387 


0.478 


Edit Distance 


0.603 


0.622 


String Signature 


0.444 


0.695 


Cosine Wavelets 


0.130 


0.371 


Holder 


0.331 


0.593 


Piecewise Probabilistic 


0.202 


0.321 



rithm. The only difference is the noise within appropriate features will be reduced 
before classification. 

4 Experimental Results 

For a fair comparison of methods, we followed the benchmarks introduced in [14]. 
We also used 1-Nearest Neighbor algorithm, evaluated by “leaving-one-out” , and 
experimented on the same data sets as [14] used: 

— Cylinder-Bell-Funnel (CBF): This data set has been used for classification by 
Kadous [12] , Geurts [9], Lin [16] and Keogh [14]. The aim is to discriminate 
three types of time series: cylinder(c), bell(b) and funnel(f). We generated 
128 examples for each class with length 128 and time step 1 as in [14]. 

— Control Chart Time Series (CC): This data set has 100 instances for each 
of the six different classes of control charts. This data set has been used 
to validate clustering [2], classification [9,16,14]. The data set we used was 
downloaded from the UCI KDD Archive [10]. 

We evaluated 4 algorithms using different features extracted from the two 
data sets described above. 1-NN is the 1-nearest neighbor algorithm that uses the 
raw data [14]. WC-NN is the algorithm that uses 1-nearest neighbor algorithm 
with decomposed wavelet coefficients within highest scale [5]. WCANN is our 
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proposed algorithm described in section 3.3. WCANR is our proposed algorithm 
described in section 3.4. 

An unclassified instance is assigned to the same class as its closest match in 
the training set. The classification error rates about different feature extraction 
algorithms on the above two data sets are presented in Table 3. We only used 
Euclidean distance as the distance measure since it superiors other distance mea- 
sures in classification on the above two data sets [14]. The error rates of various 
similarity measures with the raw data given by [14] are shown in Table 4. The 
1-NN algorithm is actually the same as Euclidean distance algorithm in Table 3. 
The Euclidean distance algorithm shown in Table 3 is taken from Table 4. 

On both data sets WCANN algorithm performs the same with WCANR 
algorithm and 1-NN algorithm performs the same with WC-NN algorithm. The 
WCANN and WCANR algorithms achieve the same accuracy compared with 
1-NN and WC-NN algorithms on CBF data set and higher accuracy on CC data 
set. 

5 Related Work 

There are a large number of techniques which have been proposed for efficiently 
time series querying based on wavelets. However, no proposed work concern 
selecting an appropriate scale. Our work is similar in spirit to the wavelet packet 
algorithm introduced by Coifman et al. [6]. The authors used entropy to select 
the best basis of wavelet packets. Our solution is different with the solution in [6] 
on two aspects. We use wavelets not wavelet packets to decompose data and we 
calculate entropy by defined features not by the energy of wavelet coefficients. 

6 Conclusions and Future Work 

We defined features on Haar wavelet coefficients that helps to overcome their sen- 
sitivity to shifted time series. We selected the appropriate scale of Haar wavelet 
decomposition for classification. We proposed a nearest neighbor classification 
algorithm using the derived appropriate scale. 

We conducted experiments on two widely used test time series data sets and 
compared the accuracy of our algorithms and those of two other algorithms on 
Euclidean distance. Our algorithms outperform other two algorithms on one data 
set and take the same results on another data set. 

We intend to extend this work to two directions. First, applying this approach 
to a real time series data set. We are in the process of working on hepatitis data 
set [11]. This hepatitis data set contains irregular time series data measured on 
nearly one thousand examinations. The target of mining this data set includes 
time series classification. Second, combining this work with distance-based clus- 
tering algorithm. In fact, the main idea of this work is selecting appropriate 
features of a time series for better similarity comparison. Clustering on these 
features instead of clustering on the time series data directly may get better 
results. 
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Abstract. The cross-sectional time series data means a group of mul- 
tivariate time series each of which has the same set of variables. Usu- 
ally its length is short. It occurs frequently in business, economics, sci- 
ence, and so on. We want to mine rules from it, such as "GDP rises 
if Investment rises in most provinces" in economic analysis. Rule 
mining is divided into two steps: events distilling and association rules 
mining. This paper concentrates on the former and applies Apriori to the 
latter. The paper defines event types based on relative differences. Con- 
sidering cross-sectional property, we introduce an ANOVA-based event- 
distilling method which can gain proper events from cross-sectional time 
series. At last, the experiments on synthetic and real-life data show the 
advantage of ANOVA-based event-distilling method and the influential 
factors, relatively to the separately event-distilling method. 



1 Introduction 

The cross-sectional time series data means a group of multivariate time series 
each of which have the same set of variables. Usually actual cross-sectional time 
series have a short length. For example, macroeconomic yearly time series is 
composed of time series from all provinces (Figure 1) within a few years; each 
province has yearly data of GDP, Investment and so on. More examples occur 
frequently in large-scale company, organization, and government, including quar- 
terly marketing data from all branch of country-wide company within several 
years, monthly taxation data from every county substation of a provincial rev- 
enue department within twelve months. The same attributes in all sections have 
the same meaning and behave similarly, which is called cross-sectional property. 

Thus, the question is to mine association rules from cross-sectional time 
series. We decompose the question into two steps: events distillation and associ- 
ation rule mining. In the first step, events are distilled from original time series 
with respect to cross-sectional property. The step is the emphasis of the paper. In 

* This work is supported by both the National High Technology Research and Deve- 
lopment Program of China (863 Program) under Grant Ao.2002AA444120 and the 
National Key Basic Research and Development Program of China (973 Program) 
under Grant Ao.2002C'B312006. 
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Province 1 Province 2 Province 3 




Fig. 1. An Example from Cross-Province Time Series of Chinese Macroeconomic 



the second step, interesting rules are mined from the distilled events. We adopt 
the existing algorithm Apriori in this step. 

This paper employs an event definition based on relative differences. This 
definition is fit for short-length time series and is easy to explanation. In addition, 
it is convenient to distill events with the relatively simple definition. 

Let us take into account cross-sectional property via Figure 1 intuitively. Each 
attribute behaves similarly across provinces. Considering relative differences, or 
called increasing ratios, some sections may similar ratios on some attribute, e.g. 
Province 1 and Province 2 on GDP; but the ratios may not be similar across 
all sections, e.g. faster increasing of Province 3 on GDP. 

So that, we introduce an ANOVA-based algorithm to do event distillation by 
considering cross-sectional property. More useful information about an attribute 
will be distilled if we consider the same attributes across sections holistically 
rather than if we consider them separately. In order to do comparison, the pa- 
per gives the separately distilling method at first, and then the ANOVA-based 
method. 

Consider ANOVA-based event-distilling algorithm. Firstly the relative dif- 
ferences of all attributes in all sections are computed. Secondly we use ANOVA 
to group sections in a greedy manner. At last, for each group of sections, we 
estimate the distribution parameters of relative differences and distill the events 
via discretization on probabilistic distribution. 

The remainder of the paper is organized as follows. Section 2 defines this 
problem exactly. Section 3 describes how to pick up proper events from original 
cross-sectional data. In Section 4, experiments on synthetic and real-life data 
show the advantage of ANOVA-based algorithm. Section 5 gives the conclusion. 



1.1 Related Work 

Statistics has introduced some cross-sectional time series analysis methods, such 
as Dummy Variable Model and Error Components Model [5]. These approaches 
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analyze inter-attribute or intra-attribute relationships via global regression mod- 
els. Unlike them, the paper does not seek global model and analyze relationship 
in a relatively non-parametric manner. The difference is as the same as the one 
between statistical and data mining methods of generic time series analysis. 

There are some data mining approaches on rule mining from time series, such 
as [3,6]. These papers mine rules from time series by two steps: event distillation 
and rule mining, which our method follows. Event distillation is also called time 
series discretization. But each paper has different methods on the two steps, 
according to its own problem. 

The event definition in the paper is based on relative differences, so does 
[1]. [3] has different event definition that takes special shapes with fixed length 
as event types. But, by the definition, we might get too few events from short- 
length time series. For example, for 15-length time series, we get 6 events with 
10-length shape in one time series, but 14 events based on first-order difference. 

2 Notions 

An event type is a specified discrete data type, which usually describes a certain 
character of time series. For instance, a set £ of event types may be {up, down, 
stable,UP, DOWN, STABLE} as Table 1 shows. 

Given a cross-sectional time series, let T be the time domain, S be the 
set of sections, and A be the set of attributes. T is a limited set, since 
time is discrete. For example, in cross-province macroeconomic time series, T 
is (1978, 1979, ..., 2003}, S is (Anhui, Beijing, ..., Zhejiang}, and A is (GDP, 
Investment, Payout, ...} 

An event type set £, an attribute set A, a time set T, and a section set 
S are given. Any event belongs to A x £, such as ”GDP up” or ’’Investment 
stable”. All events from one section in certain time form an event set E = 
(ei, 62, ..., e„} where Ci € A x £ . Thus the number of the event sets is jiSj * |T|. 

Association Rules Mining. A association rule, can be formulated as an implica- 
tion of A => B, where A, B <Z A x £ and AC\ B = 0. The first illustration of 
mining association rules is shopping basket analysis, where [2,7] give efficient al- 
gorithm called Apriori. An association rule, such as "if GDP increases, then 
Investment increases", can be formulated as an implication of A ^ B, where 
A, B G A X £ and AC\ B = 9. There are two measures: support and confidence. 
Given an association rule A ^ B, the support is P{A U B) and the confidence 
is P{A U B)/P{A). In addition, support of sections is introduced to measure 
how many sections the rule covers. That is, support of sections = Number of 
Rule-Covered Sections / Number of Total Sections. 

Analysis of Variance or ANOVA for short, is a statistical method to decide 
whether more than two samples can be considered as representative of the same 
population [4]. The paper adopts the basic One-Way ANOVA among many vari- 
ations. ANOVA returns p-value which is between 0 and 1. The critical p-value 
is usually 0.05 or 0.01. That is, if p-value is less than the critical value, data are 
not likely to be sampled from the same population. 
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Independently and Identically Distributed or i.i.d. for short. That more than two 
samples are i.i.d. means that those are sampled from the same population and 
those are not correlated to one another. 



Table 1. Event Types 



Symbol 


Description 


Condition 


up 

down 

stable 


increasing transition is high enough 
decreasing transition is high enough 
increasing or decreasing is slight 


others 


UP 

DOWN 

STABLE 


increasing is accelerated or decreasing is decelerated 
increasing is decelerated or decreasing is accelerated 
increasing or decreasing changes slightly 


others 



Comment: means n-order relative difference of a; (n > 1), that is = {xt — 

Xt-i)lxt-i, d!f^ = d(^^ — djJ_\. Then and are mean and standard deviation 
of the population generating relative differences of some attribute. Similarly and 
are for 2-order relative differences. T is the significant threshold. 



3 Events Distillation 

Table 1 gives two groups of event types. The first three are derived from 1-order 
relative differences, and the last three are derived from 2-order ones. The event 
types have intuitive meanings as the descriptions show. 

Here the relative differences are assumed in accord with some Gaussian dis- 
tribution. For generic time series, we can assume relative differences in one sec- 
tion are i.i.d., which is the assumption of separately event-distilling method. On 
the other hand, considering cross-sectional property, we can assume relative dif- 
ferences in some sections are i.i.d., which is the assumption of ANOVA-based 
event-distilling method. The following subsections give these two methods, with 
discussion of assumptions. The ANOVA-based one is the emphases of the paper, 
and the separately one is mainly for comparison. 

3.1 Separately Distilling 

[1] has tried to distill events by relative differences. But its events are discrim- 
inated by uniform threshold which is set manually. Manual setup of threshold 
is not proper for multivariate time series. It sounds reasonable that the rela- 
tive difference threshold of each attribute should be set separately. That is, the 
threshold of each attribute should be computed automatically according to its 
Gaussian distribution of relative differences and the significant threshold T. 

This paper utilizes relative difference rather than difference because of cross- 
sectional diversity. The value of change of an attribute in a section is strongly 
depended on its base. 
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Algorithm. Given x, time series data of one attribute in one section. List 1.1 
gives the algorithm to distill events via (1-order) relative differences. Distilling 
events of 2-order relative difference is alike. E is the array of event sets. 



Listing 1.1. Separately Distilling Algorithm 



Sol. T; 




Compul.e 


d f 2 . . . Length ] : 


rim: = mean { d ) ; 


yigrtui: — 


s t d _d cv i at i on ( d ) ; 


for 1:- 


2 to Length ilo liogiri 


if d[ 


l|>mn— sigma *T then 


Bll 


. addEvoiit (up ) ; 


else 


if (J l]<rnu «ig?na*T 


E[1 


. addEvenl. (down) ; 


else 




E[1 


. addEvenl ( stable ) ; 


end: 




en<l ; 





3.2 ANOVA-Based Distilling 

The separately distilling algorithm performs badly when the length of time series 
is short. It is the reason that Gaussian parameters are difficult to estimate ac- 
curately because of small sample size. Section 4 shows this issue using synthetic 
data. 

But lots of cross-sectional time series are short-length. Fortunately the cross- 
sectional property is helpful. That is, we can assume that relative differences of 
one attribute of some sections are i.i.d., and then use holistically these sections 
to estimate the distribution of relative differences. 

But how to determine which sections have i.i.d. relative differences? The sta- 
tistical method ANOVA can do us a favor. Section 1.1 gives a brief introduction 
if you are unfamiliar with ANOVA. For example, we can choose GDP attribute of 
any sections, and compute p-value by ANOVA on grouped relative differences. 
If p-value is large enough (e.g. > 0.05), these relative differences are significantly 
i.i.d. 

The assumption is also consistent with real-life data. Figure 2 illustrates this 
assumption intuitively by one attribute from yearly macroeconomic time series 
of Ghina. The p-value of ANOVA test for all provinces is around 5el0“®, which 
means significantly not i.i.d. If we take Province 3, 5, 7, 23, and 24 as one group, 
other provinces as the other group, the two groups both have greater-than-0.05 
p-value. In this way proper events can be distilled. 

Given the method of ANOVA, we can use full-search method to group sec- 
tions. That is, test all possible sections combinations and choose the best one. 
Obviously it can not avoid the combinational explosion. 

Fortunately, ANOVA is concerning about the means, so that a greedy al- 
gorithm may be applied considering the sequential order of means of relative 
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Province Number 



Fig. 2. Differences boxploting of some attribute in Chinese macroeconomic time series. 



differences. In the first step, we choose the section with the littlest mean of rel- 
ative differences. In the second step, we add the section with the littlest mean 
within the sections left; and then, test ANOVA on relative differences of the cho- 
sen sections. If they are significantly i.i.d., we repeat the second step; otherwise 
we take these sections as an group without the one added newly, and go to the 
first step. When all sections have been grouped, we can estimate the Gaussian 
parameters and distill events of relative difference group by group. 

Algorithm. Given x, time series data of one attribute in all section. List 1.2 gives 
the ANOVA-based algorithm to distill events via (1-order) relative differences. 
Distilling events of 2-order relative difference is alike. E is the array of event 
sets. The function GenerateEvents is similar to List 1.1 then omitted. 

Listing 1.2. ANOVA-based Distilling Algorithm 
Set T and Pmin ; 

Compute d [ 1 . . . Areas ] [ 2 . . . Length ] ; 

/*Sort Sections by Mean*/ 
for a: = l to NumArea do begin 
mu [ a ] : = mean ( d [ a ] ) ; 

end ; 

idx=getSortedIndex (mu) ; 

/*A Greedy Algorithm with ANOVA*/ 

11 : = 1 ; 12 : = 1 ; 

while ( 12 <NumArca) begin 

pvalue^V^OVA(d[idx [il , i2 -fl]]); 
if pvalue>Pmin then 
i2: = i2-|-l; 
else 

GenerateEvents ( il , 12 ) ; 

11: = 12+1; 12:=11 ; 
end ; 
end . 

if il<12 then GenerateEvents ( 11 , 12 ) ; 

end . 
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3.3 Remarks on Advantage of ANOVA-Based Method 

The previous subsection said that the ANOVA-based distilling algorithm behaves 
better than the separately one on cross-sectional time series. But how much 
advantage the ANOVA-based method has? And which factors can affect the 
advantage? 

Regardless of characters in concrete applications, there are three factors in 
cross-sectional time series: number of sections, number of attributes, and length 
of time series. ANOVA-based method wants to use cross-sectional property to 
solve the small sample issue caused by short-length. So that, the cross-sectional 
property is not helpful yet when there are too few sections. On the other hand, 
the small sample issue might disappear when the length of time series is long 
enough. Thus the advantage of ANOVA-based distilling algorithm, relatively to 
separately one, will be significant if the length is short enough and number of 
sections is large enough. But it is not a rigorous condition as synthetic experiment 
in Section 4 shows. 

4 Experimental Results 

Both synthetic and real-life data are used in experiments. The experiments using 
synthetic data concerns about the performance of event-distilling algorithms 
and correlative factors. The experiments using real-life data gives an interesting 
practice in detail. 



4.1 Synthetic Data 

It is known that mining association rule in time series has 2 steps: event dis- 
tillation and rule mining. The latter step has been studied well since we adopt 
the basic Apriori. So we do the performance analysis on event distillation step, 
which will make our analysis simpler and clearer. 

We are interested in not only the absolute performance of ANOVA-based 
method, but also its advantage relative to separately method. Discussed in Sec- 
tion 3.3, the number of sections and the length of discrete time will affect the 
advantage of ANOVA-based method. Thus we will do controlled experiments 
respectively on these two factors. The absolute performance is also tested in the 
controlled experiments. 

The first experiment concentrates on the effect of the number of sections. 
The synthetic cross-sectional time series data includes 5 attributes and has 10 
time periods, which is long enough that the second experiment can testify. The 
number of sections is the variable. We generate the start value and relative 
differences of each attribute to form time series. And relative differences of each 
attribute in all sections are generated in accordance with the same Gaussian 
distribution. Different from the original assumption, relative differences of all 
sections share the same distribution. The reason is that we want to find how 
much number of sections affects distilling performance. In actual applications, 
after grouping the sections by ANOVA, it is number of sections in each group 
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Number of Sections 



Number of Sections 



Length of Time Series 



Length of Time Series 



Fig. 3. Comparison by number of sections Fig. 4. Comparison by length of time se- 
ries. 



that affects distilling performance. In order to simulate really, we add some rules 
to these time series: the attribute 2 and 3 increase and decrease dependently 
on the attribute 1; the attribute 5 increases and decreases dependently on the 
attribute 4. 

The performance of algorithms is the precision-recall measure. The right 
events are known because we know the real distributions. We call right events 
set R, distilled events set D. The precision-call is defined as 



Precision = 



\RC\D\ 

\D\ 



Recall = 



|i?l 



where |.| means the norm of a set. 

The second experiment concentrates on the effect of length of time series. 
The synthetic cross-sectional time series data includes 5 attributes and has 20 
sections, which is large enough that the first experiment can testify. The length 
of time series is the variable. The data generation in this experiment is similar 
to the last experiment. In order to simulate really, relative differences of an 
attributes in the first 10 sections share one distribution, and ones in the last 10 
sections share the other distribution. The rules are as the same as ones in the 
first experiment. 

The experimental results have random changes since the data is generated 
randomly. Thus we use scatter plots, where the means can illustrate performance. 



Experimental Results. Consider Figure 3. The results of separately method are 
not influenced by the number of sections, so its mean holds on with little random 
change. The precision-recall of ANOVA-based method becomes better while the 
number of sections becomes larger. And more, the experimental results illus- 
trate that the cross-sectional property is so strong even if there are just several 
sections. Consider Figure 4. The shorter the length of time series is, the more 
useful the cross-sectional property is. And the ANOVA-based method is still 
better even though the length is more than 50. 

Usual cross-sectional time series, such as economic data, marketing data, or 
science data, may contain tens or more sections and have a short length. So we 
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can usually take advantage of ANOVA-based event-distilling algorithm on the 
real-life data. 



4.2 Real-Life Data 

The real-life data is the cross-province yearly macroeconomic time series of 
China, which composes 25 provinces and more than 150 attributes. And the time 
lengths of attributes are between 12 and 18 because of lost values. We use just 
25 provinces among 31 provinces of China because 6 provinces have too many 
lost values. There are still some lost values in the remaining 25 provinces, but 
they can not harm our method greatly and just mean losing some corresponding 
events. 

Here we do comparison on association rules rather than events, because we 
do not know the right events in real-life data and comparison on rules is more 
understandable than that on events. 

As similar as other technical analysis methods, whether a rule makes sense 
or not is dependent on whether it is meaningful to economics or not. That is, 
our method can behave as auxiliary tools to the economics. 

Table 2 shows some rules with higher support and confidence. We just use 
up, down, UP, and DOWN, which are more meaningful than stable and 
STABLE in economics. 



Table 2. Association Rules Mined from Chinese Macroeconomic Data 



No. 


Rules with ANOVA-based Distilling 


Supp 


Conf 


Sect 


1 


Resident Consumption up ^ Total Consumption up 


28% 


92% 


100% 


2 


Total Consumption up => Resident Consumption up 


28% 


91% 


100% 


3 


GDP down ^ Total Retail down 


26% 


79% 


100% 


4 


Total Retail down => GDP down 


26% 


77% 


100% 


5 


Total Industrial Production UP => Heavy I. P. UP 


26% 


93% 


100% 


6 


Heavy Industrial Production UP Total I. P. UP 


26% 


91% 


100% 


7 


Employers in State-Owned up => College Students down 


24% 


68% 


100% 


8 


College Students down ^ Employers in State-Owned up 


24% 


67% 


100% 


9 


Agricultural Production down ^ Sugar Output down 


24% 


87% 


100% 


10 


Sugar Output down ^ Agricultural Production down 


24% 


91% 


100% 


No. 


Rules with Separately Distilling 


Supp 


Conf 


Sect 


11 


Resident Consumption up Total Consumption up 


18% 


87% 


76% 


12 


Total Consumption up Resident Consumption up 


18% 


85% 


76% 


13 


Service Industry down => GDP down 


17% 


76% 


60% 


14 


GDP down ^ Service Industry down 


17% 


73% 


60% 


15 


Total Retail DOWN ^ GDP DOWN 


17% 


70% 


64% 


16 


GDP DOWN ^ Total Retail DOWN 


17% 


69% 


64% 


17 


Total Retail down ^ Government Payout down 


17% 


77% 


52% 


18 


Government Payout down ^ Total Retail down 


17% 


79% 


52% 



Comment: Sect is short for Support of Sections. 



Rules Discovery from Cross-Sectional Short-Length Time Series 613 



Considering rules distilled with ANOVA-based method, some are likely triv- 
ial or well-known. For example, the first two rules in Table 2 are well-known, 
because Resident Consumption is a major part of Total Consumption. Their 
confidences are commonly greater than 90%. Rule No. 5 and No. 6 mean that 
the increasing ratio change of Total Industrial Production is determined by 
that of Heavy Industrial Production, which is trivial. Rule No. 3 and No. 4 
shows that Total Retail is important to GDP, which is trivial too. 

Some rules are interesting, or even strange. For example. Rule No. 9 and No. 10 
mean that Sugar Output is important to Agricultural Production, which has 
a little interesting. And Rule No. 7 and No. 8 look strange. The discussion about 
they can be left to domain experts. 

All rules distilled with ANOVA-based method have 100% section cover rate, 
as well as relatively high support and confidence. In contrast, the rules distilled 
separately just cover some sections, as well as have relatively low support and 
confidence. The reason is that ANOVA-based distilling method distills events 
much more accurate than separately distilling method does. 

5 Conclusion 

This paper aims to mine interesting rules from cross-sectional time series. The 
main contribute of the paper is to introduce an ANOVA-based event-distilling 
algorithm, which is able to distill events accurately from cross-sectional short- 
length time series. Interesting and useful association rules can be mined on these 
events. Event types are defined based on relative differences of time series. We 
used the experiments on synthetic data to illustrate the advantage of ANOVA- 
based method, as well as the ones on real-life data to present a vivid illustration. 
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Abstract. We are designing new data mining techniques on boolean 
contexts to identify a priori interesting concepts, i.e., closed sets of 
objects (or transactions) and associated closed sets of attributes (or 
items). We propose a new algorithm D-Miner for mining concepts under 
constraints. We provide an experimental comparison with previous 
algorithms and an application to an original microarray dataset for 
which D-Miner is the only one that can mine all the concepts. 

Keywords: Pattern discovery, constraint-based data mining, closed sets, 
formal concepts. 



1 Introduction 

One of the most popular data mining techniques concerns transactional data 
analysis by means of set patterns. Indeed, following the seminal paper [ 1 ], hun- 
dreds of research papers have considered the efficient computation of a priori 
interesting association rules from the so-called frequent itemsets. Transactional 
data can be represented as boolean matrices (see Figure 1 ). Lines denotes trans- 
actions and columns are boolean attributes that enable to record item occur- 
rences. For instance, in Figure 1 , transaction contains the items ge, 57, gs, 
99, and 510. 

The frequent set mining problem concerns the computation of sets of at- 
tributes that are true together in enough transactions, i.e., given a frequency 
threshold. The typical case of basket analysis (huge - eventually millions - num- 
ber of transactions, hundreds of attributes, but sparse and lowly-correlated data) 
can be handled by many algorithms, including the various ApRiORi-like algo- 
rithms that have been designed during the last decade [ 2 ]. When the data are 
dense and highly-correlated, these algorithms fail but the so-called condensed 
representations of the frequent itemsets can be computed. For instance, efficient 
algorithms can compute the frequent closed sets from which every frequent set 
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Fig. 1. Example of a boolean context ri 



and its frequency can be derived without accessing the data [10,6,11,3,14]. Other 
important applications concern datasets with only a few transactions, e.g., for 
typical gene expression data where items denote gene expression properties in 
biological situations. It is however possible to use the properties of Galois con- 
nection to compute the closed sets on the smaller dimension and derive the closed 
sets on the other dimension [12]. 

In this paper, we consider bi-set mining in difficult cases, i.e., when the data 
is dense and when none of the dimensions is quite small. Bi-sets are composed 
of a set of lines T and a set of columns G. T and G can be associated by various 
relationships, e.g., the fact that all the items of G belong to each transaction of 
T (1-rectangles). It is interesting to constrain further bi-set components to be 
closed sets (also called maximal 1-rectangles or concepts [13]). Other constraints, 
e.g., minimal and maximal frequency, can be used as well. 

We propose an original algorithm called D-Miner that computes concepts under 
constraints. It works differently from other concept discovery algorithms (see, 
e.g., [8,4,9]) and (frequent) closed set computation algorithms. D-Miner can 
be used in dense boolean datasets when the previous algorithms generally fail. 
Thanks to an active use of the constraints, it enlarges the applicability of concept 
discovery for matrices whose none of the dimensions is small. Section 2 contains 
the needed definitions and a presentation of D-Miner. Section 3 provides an 
experimental validation. Finally, Section 4 is a short conclusion. 



2 D-Miner 

Let O denote a set of objects or transactions and V denote a set of items or 
properties. In Figure 1,0 = {ti, . . . ,1^} and V = { 51 , 52 , ■ • ■ ,5io}- The trans- 
actional data is represented by the matrix r of relation R O x V. We write 
G r to denote that item j belongs to transaction i or that property j 
holds for object i. 

The language of bi-sets is the collection of couples from Co x C-p where 
Co = 2® (sets of objects) and Cp = 2^ (sets of items). 

Definition 1. A bi-set (T,G) is a 1-rectangle in r ijf\/tGT and\/g G G then 
(t,g) G r. A bi-set (T,G) is a 0-rectangle in r iff Vt G T and Wg G G then 

(i,5) ^ r- 
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Definition 2. (Concept) A bi-set (T,G) is a concept in r iff (T,G) is a 1 - 
rectangle and VT' C 0\T, (TUTfG) is not a 1 -rectangle and MG' C V\G, 
(T, GUG") is not a 1 -rectangle. 

Notice that, by construction, both sets of a concept are closed sets and any al- 
gorithm that computes closed sets can be used for concept discovery [12]. 

Given Figure 1, ({G, ^2, ^s}, {<?i, 52}) is a 1-rectangle in ri but it is not a con- 
cept. Twelve bi-sets are concepts in ri. Two of them are ({ti, ^2, Gj G}: {91^93}) 
and ({G, G}j {91,92, 93, 94, 99, 9 io})- Interesting data mining processes on trans- 
actional data can be formalized as the computation of bi-sets whose set compo- 
nents satisfy combinations of primitive constraints. 

Definition 3. (Monotonic and anti-monotonic constraints) Given C a collec- 
tion of sets, a constraint C is said anti-monotonic w.r.t. C iffMa ,(3 € C such 
that a C p, C{P) C{a). C is said monotonic w.r.t. C ijJMa, P € C such that 
ct Q P, C{a) C{p). 

In A-priori like algorithms, the minimal frequency constraint (on C-p) is used 
to prune the search space. This constraint is anti-monotonic w.r.t. C on Cp. 
This constraint can be considered as monotonic on Co because when a set of 
items is larger, the associated set of transactions is smaller. 

Definition 4. (Specialization relation) Our specialisation relation on bi-sets 
from C = Co X Cp is defined by (Ti, Gi) < {T2, G2) iff T\ C T2 and G\ C G2. 



We generalize the frequency constraints on this partial order <. 

Definition 5. (Frequency constraints on concepts) A concept (T,G) satisfies a 
constraint Ct{v, a i,T) (resp. Cg(r,a2,G)) if \T\ > a\ (resp. |G| > CT2/ These 
constraints are both monotonic w.r.t. < on Co x Cp . 

For example, the set of concepts (T,G) satisfying Cg(ri , 4, G) A Ct(ri , 3, T) (a 
conjonction of monotonic constraints) is {{{9i,92,93,94},{ti,t2,tf\)}. 

2.1 D-Miner Principle 

D-Miner is a new algorithm for extracting concepts (T, G) under constraints. 
It builds the sets T and G and it uses monotonic constraints simultaneously on 
Co and Cp to reduce the search space. A concept (T, G) is such that all its 
items and objects are in relation by R. Thus, the absence of relation between 
an item g and an object t generates two concepts, one with g and without t, 
and another one with t and without g. D-Miner is based on this observation. 
Let us denote by iL a set of 0-rectangles such that it is a partition of the false 
values (0) of the boolean matrix, i.e.. Mg CV and Mt G O such that {t,g) ^ r, 
it exists one and only one element {X,Y) of H such that t C X and g € Y. 
The elements of H are called cutters. H must be as small as possible to reduce 
the depth of recursion and thus execution time. On another hand, one should 
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not waste too much time to compute H. H contains as many elements as lines 
in the matrix. Each element is composed of the attribute valued by 0 in this 
line. Time complexity for computing H is in O {n x m) where n and m are the 
dimensions of the matrix. Computing time is negligible w.r.t. the one of the 
cutting procedure. Furthermore, using this definition makes easier the pruning 
of 1-rectangles that are not concepts. 

D-Miner starts with the couple { 0 ,V) and then splits it recursively using 
the elements of H until H is empty and consequently each couple is a 1-rectangle. 
An element (a, h) of H is used to cut a couple (A, E ) if a fl A 0 and bdY 0. 
By convention, one defines the left son of (A, Y) by (A\a, Y) and the right son 
by (X,Y\b). Recursive splitting leads to all the concepts, i.e., the maximal 1- 
rectangles (see Example 1) but also some non-maximal ones (see Example 2). 
We consider in Example 3 how to prune them to obtain all the concepts and 
only the concepts. 

We now provide examples of D-Miner executions. The use of monotonic 
constraints on Co and C-p is presented later. Notice that for clarity, sets like 
{ 51 , 52 } are written 31 ^ 2 - 



Example 1. Assume O = {ti,t2,t3} and V = {51,52,53}- T2 is defined in Ta- 
ble 1 (left). Figure 2 (left) illustrates D-Miner execution. We get 4 1-rectangles 
that are the 4 concepts for this boolean context. 



Table 1. Contexts r 2 for Example 1 (left) and rs for Examples 2 and 3 (right) 
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Example 2. Assume now ra as given in Table 1 (right). Computing H provides 
{(^1,5152), (^2,52), (43,5152)}- Figure 2 (right) illustrates D-Miner execution. 
Some bi-sets are underlined and this will be explained in Example 3. 

From Figure 2 (right), we can see that (^2,52) and (43,5152) from H are not 
used to cut (4if243,53) because {(72} Cl {(73} = 0 and {5152} H {(73} = 0. The 
computed collection of bi-sets is: 

(42,5153), (0,515253), (43,53), (4243,53) } 

We see that (43,773) < (41^243,53) and (4243,773) < (4i4243,53) and thus these 
1-rectangles are not concepts. 

To solve this problem, let us introduce a new notation. Let r[T, G] denote the 
reduction of r on objects from T and on items from G. When a couple (A, A) 
is split by a cutter (a, b) G H, then (A\a, Y) (the left son) and (A, Y\b) (the 
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(* 1 * 2*31 919293) 




Fig. 2. Concept construction on r 2 (left) and on ra (right) 



right son) are generated. By construction of H (a, Y\b) is a 1-rectangle which 
is not necessarily maximal. If a concept (Cx, Cy) exists in r[X\a, Y] such that 
Cy n 6 = 0 then {Cx U a, Cy) is a concept in r[X, Y], However (Cx U a, Cy) is a 
concept in v[X, F\6] and consequently would be a son of the right son of (X, Y) 
(see Figure 3). To avoid these non-maximal 1-rectangles, we have to enforce that 
the property b(lY yf 0 is always satisfied for all the previously used left-cutters. 





b 


Y\b 




a 
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1 




X\a 




(Cx , Cy) 













Fig. 3. Non-maximal 1-rectangle occurrence 



Property 1: Let {X^Y) be a leaf of the tree and H]^{X,Y) be the set of 
cutters associated to the left branches of the path from the root to (X, Y). Then 
(X, Y) is a concept iff it contains at least one item of each element of Hl{X, Y). 
It means that when trying to build a right son {X,Y) (i.e., to remove some 
elements from T), we must check that y(a,b) G Hl(X,Y), bflY yf 0. This is 
called later the left cutting constraint. 

This has been formally studied in [5] that contains correctness and completeness 
proofs for D-Miner. 



Example 3. We take the context used for Example 2 (see Table 1 on the right). 
1-rectangles (^3,33) and (^2^3, 33) are primed using Property 1. (^3,33) comes 
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from the left cutting of (tit2i3,5i52<?3) and then the left cutting of (^3, 315233)- 
The items of (^3,33) must contain at least one item of {31,32} and of {32}, i-e., 
the precedent left cutter set of items. It is not the case and thus (^3, 33) is pruned. 
(^243,33) comes from just one left cutter: (^1,3132). It contains neither 31 nor 32. 
Nodes that are underlined in Figure 2 (right) are pruned. 



2.2 Algorithm 

Before cutting a couple (A, Y) by a cutter (a, b) in two couples (A\a, Y) and 
{X,Y\b), two types of constraints must be checked, first the monotonic con- 
straints and then the left cutting constraint. Closeness property (maximality) is 
implied by the cutting procedure. 

D-Miner is a depth-first method which generates couples ordered by relation 
<. Monotonic constraints w.r.t. either O or V are used to prune the search 
space: if {X, Y) does not satisfy a monotonic constraint C then none of its sons 
satisfies C and it is unnecessary to cut {X,Y). For instance, we can push the 
constraint C((T, G)) = Ct{T) ACg(G) where (T,G) is a bi-set, Ct{T) = \T\ > 5, 
and Cg(G) = |G| >4. 

Algorithms 1 and 2 contain the pseudo-code of D-Miner. First, the set H 
of cutters is computed. Then the recursive function cutting() is called. 

Function cutting cuts out a couple (X,Y) with the first cutter H[i] that sat- 
isfies the following constraints. First, (A, Y) must have a non empty intersection 
with H[i\. If it is not the case, cutting is called with the next cutter. Before 
cutting (A, F) in (A\a,F), we have to check the monotonic constraint on X\a 
(denoted C((A\a)) to try to prune the search space, (a, 6) is inserted into 
the set of cutters in the left cutting. Then cutting is called on (A\a,F) and 
(a, &) is removed from Hl. For the second cutting of (A, F), two constraints 
have to be checked. First the monotonic constraint on Y\b (denoted Cg{Y\b)) is 
checked. Therefore, d-miner constructs first an element (A, F) and then reduces 
simultaneously A and F to have the collection of concepts derived from (A, F). 
Secondly, monotonic constraints can be applied on A and F to prune the search 
space: ii a < (3 and ~'C(/3) then -iC(a). 

It is possible to optimize this algorithm. First, the order of the elements 
of H is important. The aim is to cut as soon as possible the branches which 
generate non-maximal 1-rectangles. H must be sorted by decreasing order of size 
of the object components. Moreover, to reduce the size of H, the cutters which 
have the same items are gathered: V(ai,6i), (02,62) S iJ, if b\ = 62 then H = 
H\{{ai,bi), (o2, 62)} U (oi U 02, 61). If \P\ > \ 0 \, we transpose the data matrix 
to obtain a set H of minimum size. The symmetry of our extractor on Cq and 
Cp allows to transpose the matrix without loosing the possibility of using the 
constraints. Indeed, in some contexts where there are few objects and many 
items, we first perform a simple transposition like in [12]. 
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Algorithm 1: D-Miner 

Input : Database r with n lines and m columns, O the set of objects, V 
the set of items, Ct and Cg are monotonic constraints on O and V. 
Output ; Q the set of concepts that satisfy Ct and Cg 
Hl emptyO 

H and Hsizs = \H\ are computed from r; 

Q 4- cutting((0, V), H, 0, Hsize,HL)', 

Algorithm 2: cutting 

Input: (A, y) a couple of 2^ x 2^, H the list of cutters, i the number 
of iterations, Hsize the size of H, Hl a. set of precedent cutters in left 
cuttings, Ct monotonic constraint on O, Cg monotonic constraint on V. 
Output: Q the set of concepts that satisfy Ct and Cg 
(a,b) ^ H[i] 

If (i < Hsize ~ f) 1 1 i“th cutter is selected 
If ((an A = 0) or (&ny = 0)) 

Q ^ Q U cutting({X, Y),H,i + l, Hsize, Hl) 

Else 

If (Ct(A\a) is satisfied) 

Hl^HlU (a, 6) 

Q -s— Q U cutting{{X\a, Y), H,i + 1, Hsize, Hl) 

HL^HL\{a,b) 

If (Cg(Y\b) is satisfied A V(a', b') G Hl, b' n Y\b / 0) 

Q<- QU cutting{{X, Y\b), H,i + 1, Hsize, Hl) 

Else 

(A,y) 

Return Q 



3 Experimental Validation 



We compare the execution time of D-Miner with those of Closet [11], Ac- 
Miner [6] and Charm [14] in three datasets. Closet, Charm and Ac-Miner 
compute closed sets under a minimal frequency constraint, i.e., the frequent 
closed sets. Due to Galois connection, in a given dataset, the number of closed 
sets in Co is the number of closed sets in £p. For a fair comparison, we trans- 
pose the matrices to have a smaller number of columns for Closet, Ac-Miner 
and Charm and to have a smaller number of lines for D-Miner. In the three 
first experiments, we compare the effectiveness of the four algorithms when com- 
puting the collection of closed sets under a minimal frequency constraint. We 
used Zaki’s implementation of Charm and Bykowski’s implementations of Ac- 
Miner and Closet [7]. In all the following figures, the minimal frequencies are 
relative frequencies. 
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First, we have studied the performance of D-Miner for computing the fre- 
quent closed sets from benchmark datasets available on line at IBM Almaden^ 
and the UCI repository. All extractions have been performed on a Pentium III 
(450 MHz, 128 Mb). We have used the benchmark “Mushroom”. Its derived 
boolean context contains 8 124 lines and 120 columns. The needed execution 
time (in seconds) to obtain the frequent closed sets is shown on Figure 4 (left). 




Fig. 4. Mushroom (left) and Connect4 (right) 



The four algorithms can compute every closed set and thus all concepts 
(minimum frequency 0.0001) within a few minutes. Indeed, the lowest frequency 
threshold corresponds to at least 1 object. Once every closed set on one dimension 
is computed, the associated concept is obtained easily. The execution time of 
Closet increases very fast compared to the three others. 

Next, we considered the benchmark “Connect4”. The derived boolean context 
contains 67 557 lines and 149 columns. The execution time to obtain frequent 
closed sets is shown on Figure 4 (right). Only Charm and D-Miner can extract 
concepts with minimal frequency equals to 0.1 (10%). D-Miner is almost twice 
faster than Charm on this dataset. 

We now provide an experimental validation on an original biological dataset 
that contains 104 lines and 304 columns. For the purpose of this paper, an 
important information is that its density is high: 17 % of the cells contain the 
value true. This is a gene expression dataset that can not be described further 
due to space limitation (see [5] for details). The execution time (in seconds) for 
computing frequent closed sets with Closet, Ac-Miner, Charm and D-Miner 
is shown on Figure 5 (left). 

D-Miner is the only algorithm which succeeds in extracting all the con- 
cepts. These data are in fact very particular: there are very few concepts before 
the frequency 0.1 (5 534 concepts) and then the number of concepts increases 
very fast (at the lowest frequency threshold, there are more than 5 millions of 
concepts). In this context, extracting putative interesting concepts needs for a 



^ See www.almaden.ibm.comcsquestdemos.html. 
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Fig. 5. A microarray dataset analysis 



very low frequency threshold, otherwise almost no concept is provided. Conse- 
quently D-Miner is much better than the other algorithms because it succeeds 
to extract concepts when the frequency threshold is lower than 0.06 whereas it 
is impossible with the others. 

End users, e.g., biologists, are generally interested in a rather small subset of 
the extracted collections. These subsets can be specified by means of user-defined 
constraints that can be checked afterwards (post-processing) on the whole col- 
lection of concepts. It is clear however that this approach leads to tedious or even 
impossible post-processing phases when, e.g., more than 5 millions of concepts 
are computed (almost 500M bytes of patterns). It clearly motivates the need for 
constraint-based mining of concepts. 

Let us consider the computation of the bi-sets that satisfy two minimal fre- 
quency constraints on Co and C-p : one on item (gene) sets and the other one on 
objects (biological situations). In Figure 5 (right), we plot the number of con- 
cepts obtained using both Ct(r,cri,T) and Cg{v,U 2 ,G) when ui and U 2 vary. It 
appears that using only one of the two constraints does not reduce significantly 
the number of extracted concepts (see values when ui = 0 or (T 2 = 0). However, 
when we use simultaneously the two constraints, the size of the concept collec- 
tion decreases strongly (the surface of the values forms a basin). For example, 
the number of concepts verifying |G| > 10 and |T| > 21 is 142 279. The number 
of concepts verifying |G| > 10 and \T\ > 0 is 5 422 514. The number of concepts 
verifying |G| > 0 and \T\ > 21 is 208 746. The gain when using simultaneously 
both constraints is significant. 

4 Conclusion 

Computing formal concepts has been proved useful in many application domains 
but remains extremely hard from dense boolean datasets like the one we have 
to process nowadays for gene expression data analysis. We have described an 
original algorithm that computes concepts under monotonic constraints. First, 
it can be used for closed set computation and thus concept discovery. Next, for 
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difficult contexts, i.e., dense boolean matrices where irone of the dimensioir is 
small, the analyst can provide monotonic constraints on both set components of 
desired concept and the D-Miner algorithm can push them into the extraction 
process. Considering one of our applications that is described iir [5], we are now 
working on the biological validation of the extracted concepts. We have also to 
compare D-Miner with new concept lattice construction algorithms like [4]. 
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Abstract. Inductive queries are queries to an inductive database that 
generate a set of patterns in a data mining context. Inductive querying 
poses new challenges to database and data mining technology. We study 
conjunctive inductive queries, which are queries that can be written as 
a conjunction of a monotonic and an anti-monotonic subquery. We in- 
troduce the conjunctive inductive query optimization problem, which is 
concerned with minimizing the cost of computing the answer set to a 
conjunctive query. In the optimization problem, it is assumed that there 
are costs Ca and Cm associated to evaluating a pattern w.r.t. a mono- 
tonic and an anti-monotonic subquery respectively. The aim is then to 
minimize the total cost needed to compute all solutions to the query. 
Secondly, we present an algorithm that aims at optimizing conjunctive 
inductive queries in the context of the pattern domain of strings and 
evaluate it on a challenging data set in computational biology. 



1 Introduction 

Many data mining problems address the problem of finding a set of patterns that 
satisfy a constraint. Formally, this can be described as the task of finding the 
set of patterns Th{Q,T>,C) = {</> G £ | Q{4>,'D), i.e. those patterns </) satisfying 
query Q on database T)}. Here C is the language in which the patterns or rules 
are described and Q is a predicate or constraint that determines whether a pat- 
tern is a solution to the data mining task or not [19]. This framework allows us 
to view the predicate or the constraint Q as an inductive query to an inductive 
database system. It is then the task of the inductive database management sys- 
tem to efficiently generate the answers to the query [7]. Within this framework 
data mining becomes an (inductive) querying process that puts data mining on 
the same methodological grounds as databases. This view of data mining raises 
several new challenges for database and data mining technology. 

* An early version of this paper was presented at the 2nd ECML/PKDD Workshop 
on Knowledge Discovery with Inductive Querying, Dubrovnik, 2003. 

** Now at the Institut fiir Informatik, Ludwig-Maximilians-Universitat Miinchen 
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In this paper, we address one of these challenges, the optimization of conjunc- 
tive inductive queries. These queries can be written as the conjunction Qa A Qm 
of an anti-monotonic and a monotonic subquery. An example query could ask 
for molecular fragments that have frequency at least 30 per cent in the active 
molecules and frequency at most 5 per cent in the inactive ones [5,18]. Conjunc- 
tive inductive queries of this type have been studied in various contexts, cf. [5, 
6,18,4]. One important result is that their solution space Th{Q,T>, C) is convex, 
which is related to the well-known concept of version spaces [20] and bound- 
ary sets [19], a fact that is exploited by several pattern mining algorithms. The 
key contribution of this paper is that we introduce an algorithm for comput- 
ing the set of solutions Th{Q,T>,C) to a conjunctive inductive query that aims 
at minimizing the cost of evaluating patterns w.r.t. the primitive constraints 
in the inductive query. More precisely, we assume that there is a cost and 
Ca associated to testing whether a pattern satisfies Qrm resp. Qa, and we aim 
at minimizing the total cost of computing Th{Q,T>, C). The algorithm that we 
introduce builds upon the work by [6] that has introduced an effective data struc- 
ture, called the version space tree, and algorithms for computing Th{Q,T>,C) in 
a level wise manner for string patterns. In the present paper, we modify this 
data structure into the partial version space tree and we also present an entirely 
different approach to computing the answer set to a conjunctive inductive query. 
Even though the algorithm and data structure are presented in the context of 
string patterns and data, we believe that the principles also apply to other pat- 
tern domains. The approach is also empirically evaluated on the task of finding 
patterns in a computational biology data set. 

The paper is organized as follows. Section 2 introduces the problem of con- 
junctive query optimization. Section 3 presents a data structure and algorithm 
for tackling it. Section 4 reports on an experimental evaluation and finally. Sect. 
5, concludes and touches upon related work. 

2 Conjunctive Inductive Queries 

In this section, we define conjunctive inductive queries as well as the pattern 
domain of strings. Our presentation closely follows that of [6]. 

A pattern language £ is a formal language for specifying patterns. Each 
pattern (j) £ C matches (or covers) a set of examples 4>e, which is a subset of the 
universe U of possible examples. One pattern (j) is more general than a pattern 
ip, written p ^ ip, if and only if pe O pg. E.g., let A be a finite alphabet. 
Us = the universe of all strings over E and denote the empty string by e. 
The traditional pattern language in this domain is Cs = A pattern p € Cs 
covers the set pg = {p G E* \ p Qp}, where p Q p denotes that ^ is a substring 
of p. For this language, p h p, it and only if p Q p. 

A pattern predicate defines a primitive property of a pattern, usually relative 
to some data set T> (a set of examples), and sometimes other parameters. For 
any given pattern, it evaluates to either true or false. 

Inspired by the domain specific inductive database MolFea [18], we now in- 
troduce a number of pattern predicates that will be used throughout this paper: 
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— minfreq(0,n,2?) evaluates to true iff 0 is a pattern that occurs in database 
V with frequency at least n € N. The frequency f{(p, V) of a pattern (f) 
in a database V is the (absolute) number of examples in V covered by 4>. 
Analogously, the predicate maxfreq(((), n, 2?) is defined. 

~ ismoregeneral(^,^/)) is a predicate that evaluates to true iff pattern </> is more 
general than pattern tjj. 

We say that m is a monotonic predicate, if for all possible parameter values 
params and data sets T>: ^ 4>,’tp £ C such that ^ V' • params) — ?► 

m((^, T>, paroms). The class of anti-monotonic predicates is defined dually. Thus, 
minfreq, and ismoregeneral are monotonic; their duals are anti-monotonic. 

A pattern predicate pred(</),T>,pamms) defines the solution set Th^pred^cj),!?, 
params), C) = {ip £ C \ pred{ip,T>, params) = true}. 

We are interested in computing solution sets Th{Q, T>, C) for conjunctive que- 
ries Q, i.e. boolean queries Q that can be written as a conjunction of a mono- 
tonic and an anti-monotonic predicate Qa A Qm- Observe that in a conjunctive 
query Qa A Qm, Qa and Qm need not be atomic expressions. Indeed, it is well- 
known that both the disjunction and conjunction of two monotonic (resp. anti- 
monotonic) predicates are monotonic (resp. anti-monotonic). Furthermore, the 
negation of a monotonic predicate is anti-monotonic and vice versa. We will also 
assume that there are cost-functions Ca and Cm associated to the anti-monotonic 
and monotonic subqueries Qa and Qm- The idea is that the cost functions reflect 
the (expected) costs of evaluating the query on a pattern. E.g., Ca{(p) denotes the 
expected cost needed to evaluate the anti-monotonic query Qa on the pattern 
(p. The present paper does not contribute specific concrete cost functions but 
rather an overall and general framework for working with such cost functions. 
Even though it is clear that some predicates are more expensive than other ones, 
more work seems needed in order to obtain cost estimates that are as reliable as 
in traditional databases. It is worth mentioning that several of the traditional 
pattern mining algorithms, such as Agrawal et al.’s Apriori [2] and the level wise 
algorithm [19], try to minimize the number of passes through the database. Even 
though this could also be cast within the present framework, the cost functions 
introduced below better fit the situation where one attempts to minimize the 
number of covers test, i.e. the number of tests whether a pattern matches or cov- 
ers a given example. Even though for simple representation languages such as 
item sets and strings this covers test can often be evaluated efficiently, there exist 
important applications, such as mining graphs and molecules [18,17], where cov- 
ers testing corresponds to the subgraph isomorphism problem, which is known 
to be NP-complete. When mining such data, it is more important to minimize 
the number of covers test than to minimize the number of passes through the 
data. Furthermore, for applications in mining graphs or molecules, the data of- 
ten fit in main memory. Thus the present framework is directly applicable to 
(and motivated by) molecular feature mining. 

By now, we are able to formulate the conjunctive inductive query optimiza- 
tion problem that is addressed in this paper: 
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Given 

~ a language C of patterns, 

— a conjunctive query Q = Qa Qm 

— two cost functions Ca and Cm from £ to M 

Find the set of patterns Th{Q,I), C), i.e. the solution set of the query Q in the 
language C with respect to the database T>, in such a way that the total cost 
needed to evaluate patterns is as small as possible. 

One useful property of conjunctive inductive queries is that their solution 
space Th{Q,T>, C) is a version space (sometimes also called a convex space). 

Definition 1. Let C he a pattern language, and I Q C. I is a version space, if 
(f)' ,ip & C : 4> ^ if ^ (j>' and £ I if G I. 

3 Solution Methods 

In this section, we first introduce partial version space trees (an extension of the 
data structure in [6]) and then show how it can be used for the optimization 
problem. 

3.1 Partial Version Space Trees 

A partial version space tree T is an extension of a suffix trie: 

— Each edge is labelled with a symbol from S, and all outgoing edges from a 
node have different labels. 

— Each node n G T uniquely represents a string s(n) G S* which is the 
concatenation of all labels on the path from the root to n. We define sfroof) = 
e. If it is clear from the context, we will simply write n instead of s(n). 

— For each node n G T there is also a node n' G T for all suffixes n' of n. 
Furthermore, if n root, there is a suffix-link to the longest proper suffix 
of n, which we denote by suffix (n). We write suffix^ (n) for suffix {suffix (n)) 
etc. and define suffix’’ {root) =_L V i G N, where _L is a unique entity. 

To obtain a partial version space tree, we augment each node n gT with: 

— There are two different labels 1^ and la, one for the monotonic and one for 
the anti-monotonic constraint. Each label may obtain one of the values © or 
0, indicating that the string s{n) satisfies the constraint or not. If the truth 
value of a constraint has not been determined yet, the corresponding label 
gets a @. 

— There is a link to the father of n, denoted by parent {n). For example, with 
n = abc we have parent {n) = ab. As for suffix {n) we write parent^ {n) for 
parent{parent{n)) etc. and define parent’{root) =T V * € N. 

~ There is a list of links to all incoming suffix-links to n which we denote by 
isl{n). For example, if n = ab and aab, cab are the only nodes in T that 
have n as their suffix, isl{n) = {aab, cab}. 



Towards Optimizing Conjunctive Inductive Queries 629 



The following conditions are imposed on partial version space trees: 

(Cl) for all leaves n in T, either lm{n) = © or lm{n) = and all nodes with 
Im = Q are leaves. 

(C2) all expanded nodes n have lm{n) = ® 

The first condition is motivated by the monotonicity of our query if n does 
not satisfy Qm-, none of its descendants can satisfy the monotonic constraint 
either, so they need neither be considered nor expanded. A consequence of these 
requirements is that nodes n with lm{n) = ® must always be expanded. An 
example of a partial version space tree can be seen in Fig. 1, where the upper 
part of a node stands for the monotonic and the lower part for the anti-monotonic 
label. The second condition is imposed because the pattern space is infinite in 
the downwards direction. 




Fig. 1. An example of a partial version space tree. Here and in the rest of this paper, 
sufBx-links are drawn lighter to distinguish them from the black parent-to-child links. 



The algorithm given in the next subsection computes the version space tree 
starting from the tree containing only the root and then expanding the tree until 
no ©-labels are found. Then, the solution to the original query consists of all 
nodes n in T that have a © in both of their labels, i.e. they are of type ©. As 
described in [6], it is also possible to construct the boundary sets S and G from 
the partial version space tree. The differences with the previous definition of a 
version space tree given by [6] are that 1) previously each node had only one 
label that indicated membership of the solution space, and 2) that neither parent 
nor isl links were present. 



3.2 An Algorithmic Framework 

We are now ready to present an algorithmic framework for computing Th(Qm A 
Qa,T^,C,s)- The key idea is that instead of constructing the version space tree 
in a top-down, Apriori-like manner, we allow for more freedom in selecting the 
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(a) Xodo <;i) is l.cs((xl against Q,„. Xo(.() (ha(. (b) If it is positive, wo mark it arx’ord- 
ttiis Botiome also applies when ooriditiori irigly. .. 

(C2) is not enforced. 




(c) ... and propagate this label up to the (d) If it is negative, the label is propagated 
root until wc reach a node that has already down to the leaves by recursively following 
been marked posilive. the ineomitig sulTix-iinks. 



Fig. 2. How monotone labels are propagated in the tree. 



pattern (f> and the query Qa{4>) or Qm{4>) to be evaluated. By doing so, we 
hope to decrease the total cost of obtaining the solution space. As a motivating 
example, assume our alphabet is A = {a, b, c} and pattern (j) = abc turns out to 
satisfy Qm- Then, by the monotonicity of Qm, we know that all patterns more 
general than (f> satisfy Qm as well, so e, a, b, c, ab and be need no longer be 
tested against Qm- Thus, by evaluating 4>, we also obtain the truth values (w.r.t. 
Qm) of six other patterns, which would all have been tested using a level wise 
strategy. If, on the other hand, (f) does not satisfy Qm, ah patterns more specific 
than (j) cannot satisfy Qm, so the node representing (j) need not be expanded. 

This suggests the following approach: whenever a pattern tp is positive w.r.t. 
Qm, we propagate the monotonic ©-label up to the root by recursively following 
4>'s parent- and suffix-links, until we reach a node that has already been marked 
positive. Furthermore, p will be expanded and all of its children are labelled 
appropriately. If 4> does not satisfy Qm, we stop the expansion of this node 
and propagate the monotonic © down to the leaves by following the incoming 
suffix-links of p. See Fig. 2 for a schematic overview of these operations. 

For the anti-monotonic query Qa, we propagate the labels in opposite direc- 
tions. That is, a ® is propagated down to the leaves (by following the children- 
and incoming suffix-links) and a © up to the root. The corresponding algorithm 
is shown in Fig. 3. We use a priority queue P to store those nodes whose truth 
value has not been fully determined, i.e. all nodes of types ©, ©, © and ©. The 
queue not only returns the next pattern (p to be evaluated, but also a variable 
pred that tells us which of the predicates Qm or Qa should be evaluated for (p. 
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Input: a query Q = Qm A Qa and a database V 
Output: a version space tree T representing Th{Q,V, C) 

VSTree T ^ {(e, ©)} / / insert empty string and mark it as unseen 

PriorityQueue P <— {e} 
while (|P| > 0) 

{(j),pred) ^ P.next // get next pattern and predicate to be evaluated 
if (jyred = antimonotone) / / evaluate Qa by accessing V 
if {Qa{4>, T^)) //a pattern satisfying Qa 

propagate © down to the leaves and remove determined patterns from P 
else //a pattern not satisfying Qa 

propagate © up to the root and remove determined patterns from P 
else if (pred = monotone) j j evaluate Qm by accessing T> 
if {Qm{(t>,'D)) 1 1 a. pattern satisfying Qm 

propagate © up to the root and remove determined patterns from P 
expand in T 

for all children ip oi <p //set children’s labels 

if (lra{suffix{lp)) = ©) Irailp) ^ © else Zm.(V’) ^ ® 

if (la(<p) = © or la(suffix{ip)) = ©) la(ip) e- © else la(lp) e- @ 

insert ^ in P if it is not fully determined 
else //a pattern not satisfying Qm 

propagate © down to the leaves and remove determined patterns from P 
return T 



Fig. 3. An algorithmic framework 



Whenever a change of labels results in a label of the other type (©, ©, ©, © or 
©), we remove the node from P. Note that nodes of type © are also deleted from 
P although their anti-monotonic part is undetermined; cf. the above discussion. 

The choice of priorities for nodes determines the search strategy being used. 
By assigning the highest priorities to the most shallow nodes (i.e. nodes that 
are close to the root), a level wise search is simulated as in [6]. On the other 
hand, by assigning the highest priorities to the deepest nodes, we are close to 
the idea of Dualize & Advance [11,10], since we will go deeper and deeper into 
the tree until we encounter a node that has only negative children. Somewhere 
in the middle between these two extremes lies a completely randomized strategy, 
which assigns random priorities to all nodes. 

3.3 Towards an Optimal Strategy 

Let us first assign four counters to each node (p in the partial version space tree P : 
Zm{4>)i Z'aitp) £^nd Zma{4>) ■ Each of them counts how many labels of nodes 

in T would be marked if ^’s label were changed (including (p itself). For example, 
Zm {<P) counts how many monotone labels would be marked © if </i turned out to 
satisfy Qm- In the tree in Fig. 1, we have Zm(cab) = 3, because marking cab with 
a © would result in marking ab and b as well, whereas ^^(ac) = 1, because no 
other nodes could be marked in their monotonic part. Likewise, Zmm{(p) counts 
how many monotone labels would change to © if turned out to be 

false. The Za{4>)- and z-,a(^!>)-counters form the anti-monotonic counterpart. We 
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define Zm{4’) = = 0 if (j)’s monotonic label is ^ ®, and likewise for Za 

and z^a- If there is no confusion about which node we talk, we will simply write 
Zm instead of Zm{4') Assume further that we know the following values: 

— ths probability that </) satisfies the predicate Qm in database V, 

— Paiip^T)), the dual of Pm for the anti-monotonic predicate Qa, 

— the costs for evaluating the monotonic predicate Qm{4>iT^) and 
~ Ca{4>,P)^ the dual of Cm for the anti-monotonic predicate Qa- 

Now 



P>) ■ + (1 - P)) ' 

is the expected value of the number of monotone labels that get marked by 
evaluating Qm for pattern (f). Since the operation of evaluating Qm^fj^^P) has 
costs Cm{4>^P)^ tho average number of marked labels per cost unit are 

i.P-mi.4),P) ■ Zm{.4>) + (1 “ Pm{4>,P)) ' 

A similar formula holds for the average number of marked anti-monotone labels, 
so the optimal node in the partial version space tree is the one where 

^a.x{^{PmZm + (1 - Pm) ' Z^m) , ^ ' {PaZa + (1 ~ Pa) ' Z^a)} ( 1 ) 

is maximal. 

The question now is how to determine the probabilities and costs. For certain 
types of queries and databases it is conceivable that costs grow with increasing 
complexity of the patterns. For example, testing the coverage of a string becomes 
more expensive as the pattern becomes longer. On the other hand, short patterns 
are more likely to satisfy Qm than long ones (and vice versa for Qa)- Therefore, 
length could be taken into account when approximating the the above costs and 
probabilities, but let us for now assume that there is no prior knowledge about 
those values, so we simply take Pm = Pa = \ and Cm = Ca = 1. With uniform 
costs and probabilities ( 1 ) breaks down to | max{zm + z^m, Za + z^a)}- 



3.4 Calculating the Counter- Values 

Next, we show how the four counter-values can be computed. Let us start with 
the Zm(<(>)-counter. Since a monotone ©-label for (j) would be propagated up to 
the root by following tp’s parent- and suffix-link, we basically have that Zm{4') is 
the sum of Zm {parent {4>)) and Zm{suffix{(p)) plus 1 for (j) itself. But, due to the 
fact that parent {suffix {4>)) = suffix {parent {(p)), we have that this Zm-value has 
been counted twice; we thus need to subtract it once (see Fig. 4): 

Zm{4>) ~ Zm{parent{(j))) + Zm{suffix{(j))) - Zm{parent{suffix{(j)))) -|- 1 (2) 

There are some exceptions to this equation: The easiest case is when suffix {(p) = 
parent{p), which happens if </) = 7 " for 7 € A,n e N. We then have 
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parent(Ci) 



O — "KD 

pfarent(sufft;ic( (|)J) = ' 

suffix(parent( <p)) 

1^) 



parent(pdr(mt(^)} = 



{larenl(O) 



0 ^0^® 



O — ^Q.„ 

parent(.iufpx( 0)) 



suJJ'iMO) 



Fig. 4. Calculating Zm’- When sum- 
ming up Zm {parent and Zm {suffix{(j))), 
Zm{parent{sujfix{(j)))) has been counted 
twice. 



Fig. 5. If (/> = 71... 7n (7i £ -S’) 
and suffix'^ {(f)) = parent^ {(f>) , we have 
71... 7„_3 = 73 • • • 7rt, i-e. 71 = 73 = 7s = 
• . . , 72 = 74 = ■ • • 



Zm{4>) = Zm{po,rent{(j))) + 1 because parent {(f>) is the only immediate generaliza- 
tion of (j). A slightly more complicated exception is when suffix^ {(f) = parent‘s {(f>) , 
which happens when (j) = ■jSjSj . . . for 7, (5 € A (see Fig. 5). Then Zm{4>) = 
Zm{p<ii"ent{(j))) + 2, because all patterns that are more general than (j) (apart 
from suffix{(f>)) are already counted by Zm{po,rent{(f))) . Similar rules for calculat- 
ing Zm hold for exceptions of the type suffix^ {(f) = parent^ {(f>): 

Lemma 1. The counter-value for Zm is given by 



Zm{f) — ^ 



Zm{po,rent{(f))) n 
Zm{parent{<p)) Zm{suffix{(j)))- 
Zm{parent {suffix {(j}))) 1 



if suffix^ {(f) = parent'^ {(j)) 
otherwise 



where we take the smallest value of n for which the “exceptional” case applies. 

In practical implementations of this method it is advisable to “cut off” the 
search for the exceptional cases at a fixed depth and take the value of the 
“otherwise”-case as an approximation to the true value of Zm- 

Since anti-monotone ©-labels are propagated in the same direction as mono- 
tone ©-labels, lemma 1 holds for z^a as well. For the remaining two counters, 
we have to “peek” in the other direction, i.e. we have to consider the children 
and incoming suffix-links of (j). In a similar manner as we did for Zm, we have to 
consider the values that have been counted twice when summing over the zfs 
of </)’s children and incoming suffix-links. These are the children of all incoming 
suffix-links, because their suffix-links point to the children of cj). 



Za{f)^ Za{f)+ Y ^a{f)~ Y Za{f) + I- (3) 

'ip^children{(j)) 'ijj^children{isl{(f>)) 

Again, we need to consider some special cases where the above formula does 
not hold. Due to space limitations, we only sketch some of these cases, see [8] 
for more details. The first is when one of (f’s children has (p as its suffix, which 
happens iff <^ = 7" for 7 € if,u G N, because one of (p’s sons is 7"+^. In this 
case, we just sum once over this node and do not subtract Za of y^+^’s children, 
because they were counted only once. The second exception arises when ■0, one 
of (p’s grandchildren, has one of (p’s incoming suffix-links as its suffix. 
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4 Experimental Results 

We implemented the algorithm from Fig. 3 with two different queuing-strategies. 
The first, called Random, uses random priorities for the nodes in queue P. The 
second strategy, called CounterMax, works with the counter-values from Sect. 
3.3, where we chose uniform costs and probabilities. We checked for exceptional 
cases up to suffix^, as explained in Sect. 3.4. According to the algorithm and to 
(1), each pattern is tested either against or Qa , depending on which of the 
subqueries yields the maximum. We compared the results to an implementation 
of algorithm VST of [6] which constructs the version space tree in two passes 
(called Descend and Ascend). The Descend algorithm is a straightforward 
adaptation of the Apriori and levelwise algorithm for use with strings and version 
space trees. It computes the set of all solutions w.r.t. Qm - Ascend starts from 
this result working bottom up and starting from the leaves of the tree. For each 
leaf. Ascend tests whether it satisfies Qa , if it does, the parent of the leaf will 
be tested; if it does not, the pattern is labelled © and the labels are propagated 
towards the parents and suffixes, more details can be found in [6]. 

We used a nucleotide database to compare the three algorithms, so our alpha- 
bet was S = {a, c,g,t}. Working with a large alphabet significantly increases 
the number of nodes with a @. The first dataset T>i was used for a minfrequency 
query and consisted of the first hundred nucleotide sequences from the Hepati- 
tis C virus of the NIH genetic sequence database GenBank [23]. The second 
dataset T >2 held the first hundred sequences from the Escherichia coli bacterium 
and was used for a maxfrequency query. The average length of the entries in T>i 
was about 500, the maximum 10,000. For T >2 we had the values 2,500 and 30,000, 
respectively. We do not pretend that our results have any biological relevance; we 
simply used these datasets as a testbed for the different methods. We ran each 
algorithm several times for the query minfreq(()); min; T>i) A maxfreq(^; max; V 2 ), 
where each of the variables min and max could take one of the values {2, 3, 4, 5, 6, 
7, 8, 9, 10, 15, 20, ... , 95}. Some of the results are listed in Fig. 6. 

Figures 6. (a) and 6.(b) show how the number of database accesses grows with 
decreasing values for min when the maximum frequency is constant. Although 
max has been fixed to 1, similar graphs could be shown for other values of 
max. Note that in the region where the hard instances lie {min G {2, . . . , 10}), 
CounterMax performs significantly better than VST. This is in particular true 
for the number of evaluations of Qa (Fig. 6.(b)). For the easy instances, our 
method takes slightly more evaluations of both predicates. This is obvious; if 
the length of the longest pattern satisfying Qm is small, it is very unlikely to 
beat a levelwise method. The random strategy lies somewhere in the middle 
between those two other methods. 

Figures 6.(c) and 6.(d) show the performance when min is fixed and max 
changes. The first thing to note is that in Fig. 6.(c) the number of evaluations 
of Qm is constant for VST. This is a simple consequence of how the algorithm 
works. Again, CounterMax takes less evaluations than VST, and Random is 
in between. In Fig. 6.(d) we can see that for VST, the number of evaluations 
of Qa levels off when max decreases, whereas the other two methods behave 
conversely. The reasons for this are clear: VST treats the anti monotonic query 
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Fig. 6. A comparison of the three algorithms. Note that in figures (a) and (b) a loga- 
rithmic scale has been used. 



in a bottom-up manner by starting at the leaves. When it encounters a negative 
pattern w.r.t. Qa, it propagates this label up to the root. This is more likely to 
happen at the leaves for small max, so in these cases it saves a lot of evaluations 
of the anti monotonic predicate. For methods CounterMax and Random it is 
better when positive patterns (w.r.t. Qa) are close to the root, which happens 
for large max, because then all newly expanded nodes will “automatically” be 
marked with a © and need never be tested against Qa- 



5 Related Work and Conclusions 

Algorithms that try to minimize the total number of predicate evaluations have 
been around for several years, most notably Gunopulos et al.’s Dualize & Ad- 
vance-algorithm [11,10] that computes S'(minfreq(^, •, T>), £) in the domain of 
itemsets. This works roughly as follows: first, a set MS of maximal specific sen- 
tences is computed by a randomized depth-first search. Then the negative border 
of MS is constructed by calculating a minimum hypergraph transversal of the 
complements of all itemsets in MS. This process is repeated with the elements 
from the hypergraph transversal until no more maximal specific sentences can 
be found. The result is then set of all maximal interesting itemsets. 
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Although Gunopulos et al. work with itemsets and only consider monotonic 
predicates, there is a clear relation to our approach. Whereas the former method 
needs to compute the minimum hypergraph transversals to find the candidates 
for new maximal interesting sentences, these can be directly read off the partial 
version space tree. In fact, all nodes whose monotonic part is still undetermined 
are the only possible patterns of S'(minfreq((/), ■,!)),€) that have not been found 
so far. These are exactly the nodes that are still waiting in the priority queue. 
So by performing a depth-first expansion until all children are negative, our 
algorithm’s behaviour is close to that of Dualize & Advance. It should be noted 
that the two strategies are not entirely equal: if a node (j) has negative children 
only, it is not necessarily a member of S because there could still be more specific 
patterns that have (p as their suffix and satisfy Qm- 

One of the fastest algorithms for mining maximal frequent sets is Bayardo’s 
Max-Miner [3]. This one uses a special set-enumeration technique to find large 
frequent itemsets before it considers any of their subsets. Although this is com- 
pletely different from what we do at first sight, CounterSum also has a tendency 
to test long strings first because they will have higher z^-values. By assigning 
higher values to Pm for long patterns this behaviour can even be enforced. 

Finally, the present work is a significant extension of that by [6] in that 
we have adapted and extended their version space tree and also shown how it 
can be used for optimizing the evaluation of conjunctive queries. The presented 
technique has also shown to be more cost effective and flexible than the Ascend 
and Descend algorithms proposed by [6], which employ a traditional level wise 
search which minimizes the number of passes through the data rather than a 
more informative cost-function. 
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Abstract. In many data mining projects information from multiple 
data sources needs to be integrated, combined or linked in order to allow 
more detailed analysis. The aim of such linkages is to merge all records re- 
lating to the same entity, such as a patient or a customer. Most of the time 
the linkage process is challenged by the lack of a common unique entity 
identiher, and thus becomes non-trivial. Linking todays large data collec- 
tions becomes increasingly difficult using traditional linkage techniques. 
In this paper we present an innovating data linkage system called Febrl, 
which includes a new probabilistic approach for improved data cleaning 
and standardisation, innovative indexing methods, a parallelisation ap- 
proach which is implemented transparently to the user, and a data set 
generator which allows the random creation of records containing names 
and addresses. Implemented as open source software, Febrl is an ideal 
experimental platform for new linkage algorithms and techniques. 

Keywords: Record linkage, data matching, data cleaning and standard- 
isation, parallel processing, data mining preprocessing. 



1 Introduction 

Data linkage can be used to improve data quality and integrity, to allow re-use 
of existing data sources for new studies, and to reduce costs and efforts in data 
acquisition for research studies. In the health sector, for example, linked data 
might contain information which is needed to improve health policies, informa- 
tion that is traditionally collected with time consuming and expensive survey 
methods. Linked data can also help in health surveillance systems to enrich data 
that is used for pattern detection in data mining systems. Businesses routinely 
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deduplicate and link their data sets to compile mailing lists. Another application 
of current interest is the use of data linkage in crime and terror detection. 

If a unique entity identifier or key is available in all the data sets to be 
linked, then the problem of linking at the entity level becomes trivial, a simple 
join operation in SQL or its equivalent in other data management systems is all 
that is required. However, in most cases no unique key is shared by all of the 
data sets, and more sophisticated linkage techniques need to be applied. These 
techniques can be broadly classified into deterministic or rules-based approaches 
(in which sets of often very complex rules are used to classify pairs of records as 
links, i.e. relating to the same person or entity, or as non-links) , and probabilistic 
approaches (in which statistical models are used to classify record pairs). Proba- 
bilistic methods can be further divided into those based on classical probabilistic 
record linkage theory as developed by Fellegi & Sunter [6] , and newer approaches 
using maximum entropy, clustering and other machine learning techniques [2,4, 
5,10,12,14,19]. 

Computer-assisted data linkage goes back as far as the 1950s. At that time, 
most linkage projects were based on ad hoc heuristic methods. The basic ideas of 
probabilistic data linkage were introduced by Newcombe & Kennedy [15] in 1962 
while the theoretical foundation was provided by Fellegi & Sunter [6] in 1969. 
The basic idea is to link records by comparing common attributes, which include 
person identifiers (like names, dates of birth, etc.) and demographic information. 
Pairs of records are classified as links if their common attributes predominantly 
agree, or as non-links if they predominantly disagree. If two data sets A and B 
are to be linked, record pairs are classified in a product space A x B into M, 
the set of true matches, and U, the set of true non-matches. Fellegi & Sunter [6] 
considered ratios of probabilities of the form 

P(7 £ F\M) 

P(7 e F\U) 

where 7 is an arbitrary agreement pattern in a comparison space F. For example, 
F might consist of six patterns representing simple agreement or disagreement 
on (1) given name, (2) surname, (3) date of birth, (4) street address, (5) suburb 
and (6) postcode. Alternatively, some of the 7 might additionally account for the 
relative frequency with which specific values occur. For example, a surname value 
“Miller” is normally much more common than a value “Dijkstra” , resulting in a 
smaller agreement value. The ratio R or any monotonically increasing function 
of it (such as its logarithm) is referred to as a matching weight. A decision rule 
is then given by 

if i? > t upper, then designate a record pair as link 

if tiower Si R tupper, then designate a record pair as possible link 

if i? < tiower, then designate a record pair as non-link 

The thresholds tiower and tupper are determined by a-priori error bounds on false 
links and false non-links. If 7 e P mainly consists of agreements then the ratio 
R would be large and thus the record pair would more likely to be designated as 
a link. On the other hand for a 7 e P that primarily consists of disagreements 
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the ratio R would be small. The class of possible links are those record pairs for 
which human oversight, also known as clerical review, is needed to decide their 
final linkage status (as often no additional information is available the clerical 
review process becomes one of applying human intuition, experience or common 
sense to the decision based on available data). 

In this paper we present some key aspects of our parallel open source data 
linkage system Febrl (for “Freely extensible biomedical record linkage”), which 
is implemented in the object-oriented language Python^ (which is open source 
itself) and freely available from the project web page. Due to the availability 
of its source code, Febrl is an ideal platform for the rapid development and 
implementation of new and improved data linkage algorithms and techniques. 

2 Related Work 

The processes of data cleaning, standardisation and data linkage have various 
names in different user communities. While statisticians and epidemiologists 
speak of record or data linkage [6], the same process is often referred to as data 
or field matching, data scrubbing, data cleaning, preprocessing, or as the object 
identity problem [7,11,18] by computer scientists and in the database community, 
whereas it is sometimes called merge/purge processing [9], data integration [4], 
list washing or ETL (extraction, transformation and loading) in commercial pro- 
cessing of customer databases or business mailing lists. Historically, the statisti- 
cal and the computer science community have developed their own techniques, 
and until recently few cross-references could be found. 

Improvements [19] upon the classical Fellegi & Sunter [6] approach include 
the application of the expectation-maximisation (EM) algorithm for improved 
parameter estimation [20], and the use of approximate string comparisons [16] 
to calculate partial agreements when attribute values have typographical er- 
rors. Fuzzy techniques and methods from information retrieval have recently 
been used to address the data linkage problem [2] . One approach is to represent 
records as document vectors and to compute the cosine distance [4] between 
such vectors. Another possibility is to use an SQL like language [7] that allows 
approximate joins and cluster building of similar records, as well as decision func- 
tions that decide if two records represent the same entity. Other methods [11] 
include statistical outlier identification, pattern matching, clustering and asso- 
ciation rules based approaches. 

In recent years, researchers have also started to explore the use of machine 
learning and data mining techniques to improve the linkage process. The authors 
of [5] describe a hybrid system that in a first step uses unsupervised clustering 
on a small sample data set to create data that can be used in the second step 
to classify record pairs into links or non-links. Learning field specific string- 
edit distance weights [14] and using a binary classifier based on support vector 
machines (SVM) is another approach. A system that is capable to link very 
large data sets with hundreds of millions of records ” using special sorting and 
preprocessing techniques - is presented in [21]. 

^ http://www.python.org 



Febrl - A Parallel Open Source Data Linkage System 641 





Doc Peter 


Miller 


42 Main Rd. App. 3a 


Canberra A.C.T. 2600 


r 

Title 


r » 

Givenname Surname 


r 

Geocode 


\ 

Locality 


doctor 


peter 


miller 


42 Main Rd. App. 3a 


Canberra A.C.T. 2600 



Date of Birth 



Day Month Year 



Wayfare Wayfare Wayfare 
no. name type 



Unittype 



no. Localityname Territory Postcode 



42 main road apartment 3a canbeiTa 



2600 



Fig. 1. Example name and address standardisation. 



3 Probabilistic Data Cleaning and Standardisation 

As most real world data collections contain noisy, incomplete and incorrectly 
formatted information, data cleaning and standardisation are important prepro- 
cessing steps for successful data linkage, and before such data can be loaded into 
data warehouses or used for further analysis [18]. Data may be recorded or cap- 
tured in various, possibly obsolete, formats and data items may be missing, out 
of date, or contain errors. The cleaning and standardisation of names and ad- 
dresses is especially important for data linkage, to make sure that no misleading 
or redundant information is introduced (e.g. duplicate records). 

The main task of data cleaning and standardisation is the conversion of 
the raw input data into well defined, consistent forms and the resolution of 
inconsistencies in the way information is represented or encoded. The example 
record shown in Figure 1 consisting of three input components is cleaned and 
standardised into 14 output fields (the dark coloured boxes). Comparing these 
output fields individually with the corresponding output fields of other records 
results in a much better linkage quality than just comparing the whole name or 
the whole address as a string with the name or address of other records. 

Rule-based data cleaning and standardisation as currently done by many 
commercial systems is cumbersome to set up and maintain, and often needs 
adjustments for new data sets. We have recently developed (and implemented 
within Febrl) new probabilistic techniques [3] based on hidden Markov models 
(HMMs) [17] which showed to achieve better standardisation accuracy and are 
easier to set-up and maintain compared to popular commercial linkage software. 

Our approach is based on the following three steps. First, the input strings are 
cleaned. This involves converting input into lower-case characters, and replacing 
certain words and abbreviations with others (these replacements are listed in 
look-up tables that can be edited by the user). In the second step, the input 
strings are split into a list of words, numbers and characters, which are then 
tagged using look-up tables (mainly for titles, given- and surnames, street names 
and types, suburbs, postcodes, states, etc.) and some hard-coded rules (e.g. for 
numbers, hyphens or commas). Thirdly, these tagged lists are segmented into 
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Fig. 2. Simple example hidden Markov model for names. 



output fields using a HMM, i.e. by using the Viterbi algorithm [17] the most 
likely path through the HMM gives the corresponding output fields (the states 
of the HMM). 

Details about how to efficiently train the HMMs for name and address stan- 
dardisation, and experiments with real-world data are given in [3]. Training of 
the HMMs is quick and does not require any specialised skills. For addresses, 
our HMM approach produced equal or better standardisation accuracies than 
a widely-used rule-based system. However, accuracies were slightly worse when 
used with simpler name data [3]. 

4 Blocking, Indexing, and Classification 

Data linkage considers the distribution of record pairs in the product space 
A X B and determines which of these pairs are links. The number of possi- 
ble pairs equals the product of the sizes of the two data sets A and B. The 
straight-forward approach would consider all pairs and model their distribution. 
As the performance bottleneck in a data linkage system is usually the expensive 
evaluation of a similarity measure between pairs of records [1], this approach 
is computationally not feasible for large data sets, it is non-scalable. Linking 
two data sets each with 100, 000 records would result in ten billion potential 
links (and thus comparisons). On the other hand, the maximum number of links 
that are possible corresponds to the number of records in the smaller data set 
(assuming a record can be linked to only one other record). Thus, the space 
of potential links becomes sparser with increasing number of records, while the 
computational efforts increase exponentially. 

To reduce the huge amount of possible record comparisons, traditional data 
linkage techniques [6,19] work in a blocking fashion, i.e. they use one or more 
record attributes to split the data sets into blocks. Only records having the same 
value in such a blocking variable are then compared (as they will be in the same 
block). This technique becomes problematic if a value in a blocking variable is 
recorded wrongly, as the corresponding record is inserted into a different block. 
To overcome this problem, several iterations (passes) with different blocking 
variables are normally performed. 

While such blocking (or indexing) techniques should reduce the number of 
comparisons made as much as possible by eliminating comparisons between 
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records that obviously are not links, it is important that no potential link is 
overlooked because of the indexing process. Thus there is a trade-off between 
(a) the reduction in number of record pair comparisons and (b) the number of 
missed true matches (accuracy). 

Febrl currently contains three different indexing methods, with more to be 
included in the future. In fact, the exploration of improved indexing methods is 
one of our major research areas [1]. The first indexing method is the standard 
blocking method [6,19] applied in traditional data linkage systems. The second 
indexing method is based on the sorted neighbourhood [10] approach, where 
records are sorted alphabetically according to the values of the blocking variable, 
then a sliding window is moved over the sorted records, and record pairs are 
formed using all records within the window. The third method uses n-grams (sub- 
strings of length n) and allows for fuzzy blocking (in the current implementation 
we use bigrams, i.e. n = 2). The values of the blocking variable are converted 
into lists of bigrams, and permutations of sub-lists are built using a threshold 
(a value between 0.0 and 1.0) of all possible permutations. The resulting bigram 
sub-lists are converted back into strings and used as keys in an inverted index, 
which is then used to retrieve the records in a block. 

Initial experiments [1] showed that innovative indexing methods can improve 
upon the traditional blocking used in data linkage, but further research needs 
to be conducted. Other research groups have also investigated the use of 
n-grams [2] as well as high-dimensional approximate distance metrics to form 
overlapping clusters [12]. 

For each record pair in the index a vector containing matching weights is calcu- 
lated using field comparison functions (for strings, numbers, dates and times). 
This vector is then used to classify the pair as either a link, non-link, or possible 
link (in which case the decision should be done by a human review). While the 
classical Fellegi & Sunter [6] simply sums all the weights in the vector into one 
matching weight, alternative classifiers are possible which improve upon this. 
For example, separate weights can be calculated for names and addresses, or 
machine learning techniques can be used for the classification task. Classifiers 
currently implemented in Febrl are the classical Fellegi & Sunter [6] classifier de- 
scribed earlier, and a, flexible classifier that allows the calculation of the matching 
weights using various functions. 



5 Parallelisation 



Although computing power has increased tremendously in the last few decades, 
large-scale data cleaning, standardisation and data linkage are still resource- 
intensive processes. There have been relatively few advances over the last decade 
in the way in which probabilistic data linkage is undertaken, particularly with 
respect to the tedious clerical review process which is still needed to make deci- 
sions about pairs of records whose linkage status is doubtful. In order to be able 
to link large data sets, parallel processing becomes essential. Issues that have to 
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Step 1 - Loading and indexing Step 2 - Record pair comparison and classification 





Fig. 3. Speedups for parallel deduplication (internal linkage). 



be addressed are efficient data distribution, fault tolerance, dynamic load bal- 
ancing, portability and scalability (both with the data size and the number of 
processors used). 

Confidentiality and privacy have to be considered as data linkage deals 
with partially identified data, and access restrictions are required. The use of 
high-performance computing centers (which traditionally are multi-user envi- 
ronments) becomes problematic. An attractive alternative are networked per- 
sonal computers or workstations which are available in large numbers in many 
businesses and organisations. Such office based clusters can be used as virtual 
parallel computing platforms to run large scale linkage tasks over night or on 
weekends. 

Parallelism within Febrl is currently in its initial stages. Based on the well 
known Message Passing Interface (MPI) [13] standard, and the Python module 
Pypar^ which provides bindings to an important subset of the MPI routines, 
parallelism is implemented transparently to the user of Febrl. 

To give an idea on the parallel performance of Febrl some initial timing re- 
sults of experiments made on a parallel computing platform (a SUN Enterprise 
450 shared memory (SMP) server with four 480 MHz Ultra-SPARC II proces- 
sors and 4 Giga Bytes of main memory) are presented in this section. Three 
internal linkage (deduplication) processes were performed with 20,000, 100,000 
and 200,000 records, respectively, from a health data set containing midwife 
data records provided by the NSW Department of Health. Six field comparison 
functions were used and the classical blocking index technique with three indexes 
(passes) was applied. The standard Fellegi & Sunter [6] classifier was used to 
classify record pairs. 

These deduplication processes were run using 1, 2, 3 and 4 processors, re- 
spectively. In Figure 3 speedup (which is defined as the time on one processor 
divided by the time on 2, 3, or 4 processors, respectively) results are shown 
scaled by the number of records (i.e. the total elapsed run times divided by the 
number of records were used to calculate the speedups). The results show that 

http : //datamining . aim. edu. au/pypar/ 
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Fig. 4. Randomly generated example data set. 



the record pair comparison and classification step (step 2, which takes between 
94% and 98% of the total run times) is scalable, while loading and building of 
the blocking indexes (step 1) results in lower speedup values and is not scalable. 
As most time is spent in step 2, the overall parallel performance is quite scalable. 
Communication times can be neglected as they were less than 0.35% of the total 
run times in all experiments. 



6 Data Set Generation 

As data linkage is dealing with data sets that contain partially identified data, 
like names and addresses, it can be very difficult to acquire data for testing 
and evaluating newly developed linkage algorithms and techniques. For the user 
it can be difficult to learn how to apply, evaluate and customise data linkage 
algorithms effectively without example data sets where the linkage status of 
record pairs is known. 

To overcome this we have developed a database generator based on ideas by 
Hernandez & Stolfo [9]. This generator can create data sets that contain names 
(based on frequency look-up tables for given- and surnames), addresses (based 
on frequency look-up tables for suburbs, postcodes, street numbers, types and 
names, and state/territory names), dates (like dates of birth), and randomly 
created identifier numbers (for example for social security numbers). 

To generate a data set, a user needs to provide the number of original and 
duplicate records to be created, the maximal number of duplicates for one origi- 
nal record, a probability distribution of how duplicates are created (possible are 
uniform, Poisson and zipf), and the probabilities for introducing various ran- 
dom modifications to create the duplicate records. These modifications include 
inserting, deleting, transposing and substituting characters; swap a field value 
(with another value from the same look-up table); inserting or deleting spaces; 
setting a field value to missing; or swapping the values of two fields (e.g. sur- 
name with given name). Each created record is given a unique identifier, which 
allows the evaluation of accuracy and error rates for data linkage procedures 
(false linked record pairs and un- linked true matches). 

Figure 4 shows a small example data set containing 4 originals and 6 duplicate 
records, randomly created using Australian address frequency look-up tables. 
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7 Conclusions and Future Work 

Written in an object-oriented open source scripting language, the Febrl data 
linkage system is an ideal experimental platform for researchers to develop, im- 
plement and evaluate new data linkage algorithms and techniques. While the 
current system can be used to perform smaller data cleaning, standardisation 
and linkage tasks, further work needs to be done to allow the efficient linkage of 
large data sets. We plan to improve Fehrl in several areas. 

For data cleaning and standardisation, we will be improving upon our re- 
cently developed probabilistic techniques [3] based on hidden Markov models 
(HMMs), by using the Baum- Welch forward-backward algorithm [17] to re- 
estimate the probabilities in the HMMs, and we will explore techniques that can 
be used for developing HMMs without explicitly specifying the hidden states. 

We aim to further explore alternative techniques for indexing based on high- 
dimensional clustering [12], inverted indexes, or fuzzy n-gram indexes [1,2], in 
terms of their applicability for indexing as well as their scalability both in data 
size and parallelism. 

Current methods for record pair classification based on the traditional Fellegi 
& Sunter [6] approach apply the semi-parametric mixture models and the EM al- 
gorithm [20] for the estimation of the underlying densities and the clustering and 
classification of links and non-links [19]. We will investigate the performance of 
data mining or non-parametric techniques including tree-based classifiers, sparse 
grids [8] and Bayesian Nets. An advantage of these methods is that they allow 
adaptive modelling of correlated data and can deal with complex data and the 
curse of dimensionality. For all these methods current fitting algorithms will be 
adapted and new ones developed. 

We will continue to improve upon the parallel processing functionalities of 
Febrl with an emphasis on running large linkage processes on clusters of personal 
computers (PCs) and workstations as available in many businesses and organi- 
sations. Such office based PC clusters (with some additional software installed) 
can be used as virtual parallel computing platforms to run large scale linkage 
tasks over night or on weekends. Confidentiality and privacy aspects will need to 
be considered as well, as data linkage in many cases deals with identified data. 
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Abstract. ECOC approach can be used to reduce a multiclass categorization 
problem to multiple binary problems and to improve the generalization of clas- 
sifiers. Yet there is no single coding method that can generate ECOCs suitable 
for any number of classes. This paper provides a search-coding method that as- 
sociates nonnegative integers with binary strings. Given any number of classes 
and an expected minimum hamming distance, the method can find out a satis- 
fied output code through searching an integer range. Experimental results show 
that, as a general coding method, the search-coding method can improve the 
generalization for both stable and unstable classifiers efficiently 



1 Introduction 

One-per-class[l], all-pairs[2], meaning-distributed cor/e[3] and error-correcting out- 
put code (ECOC) [4] are four approaches to reduce a multiclass problem to multiple 
binary classification problems[5]. The advantage of error-correcting output code 
strategy over other approaches is that: ECOC approach can recover the error results 
of several binary functions, which improves the predictive accuracy of the supervised 
classifiers. Some researches show that ECOCs can reduce both variance and bias er- 
rors for multiclass problems [6]. But there are no general methods to construct effec- 
tive error-correcting output codes. Dietterich and Bakiri[5] introduced four coding 
methods, while all of them have disadvantages: Exhaustive codes are unsuitable for 
the classification tasks with many classes since it will increase training time greatly; 
Both column selection and randomized hill climbing are uncertain methods; BCH 
codes can only adapt to the problems where the number of classes is a power of two. 
Finding a single method suitable for any number of classes is an open research prob- 
lem. 

This paper explores a search-coding method that associates nonnegative integers 
with binary strings. Given any number of classes and an expected minimum hamming 
distance, the method can find out a satisfied output code through searching an integer 
range. The search-coding method can be used as a general coding method to construct 
error-correcting output codes. 
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2 Search- Coding Method for Supervised Learning 

2.1 Search-Coding Method 



Our search-coding method gets error-correcting output codes through searching an 
integer range. Before searching process, a table named CodeTable must be created. 
Each item of the table item{d,n) saves the maximum number of codewords that sat- 
isfy the code length n{n>\) and the minimum hamming distance d{d >\) . Code- 
Table can be saved as permanent information after being created. 

Table 1. CodeTable with 3<d <9 and 3 < « < 16 



d 


n 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


3 


2 


2 


4 


8 


16 


16 


32 


64 


128 


256 


512 


1024 


2048 


2048 


4 


0 


2 


2 


4 


8 


16 


16 


32 


64 


128 


256 


512 


1024 


2048 


5 


0 


0 


2 


2 


2 


4 


4 


8 


16 


16 


32 


64 


128 


256 


6 


0 


0 


0 


2 


2 


2 


4 


4 


8 


16 


16 


32 


64 


128 


7 


0 


0 


0 


0 


2 


2 


2 


2 


4 


4 


8 


16 


32 


32 


8 


0 


0 


0 


0 


0 


2 


2 


2 


2 


4 


4 


8 


16 


32 


9 


0 


0 


0 


0 


0 


0 


2 


2 


2 


2 


2 


4 


4 


4 



Fig. 1(a) depicts the pesudo code of creating one item for CodeTable. In function 
CreateTableItem{d,n) , integer “0” is the only one element in the initialized set A. 

Beginning from “1”, we search the whole integer range [1,2" -1] in sequence and 
add more integers to A. If the hamming distance between the corresponding binary 
string of an integer x and that of any integer in set A is equal to or larger than d, then 
add X to A, and test the next integer in turn. Function DiffBit{G,H) compares two bi- 
nary strings and returns the number of different bits. Function Bin{x,n) converts in- 
teger X to an n-bit binary string whose j^n) bit is [x/2'' ']mod2 . After 

searching through the range [1,2" - 1] , the number of the elements in set A is the 
value of item{d,n). Table 1 lists each item of CodeTable where 3<d<9 and 
3 < w < 16 . 

During searching process, the code length must be decided at first according to the 
saved CodeTable, the number of classes m, and the expected minimum hamming 
distance d. The length will be n if two neighbored table items item{d,n — \) and 
item(d,n) satisfy item{d,n- \)<m< item{d,n) . Then we construct an output code 
with m codewords whose code length is n and minimum hamming distance is d. Fig. 
1(b) gives the pesudo code of searching an output code. Given any values of d and m, 
function SearchCode{d ,m) will return a satisfied output code. In SearchCode{d,m) , 
function FindCodeLen{CodeTable,d ,m) is used to decide the length of the output 
code. Then in a similar way as CreateTableltem{d ,n) , we find m integers recorded in 
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set A. At last, each element of A is translated into an n-bit binary string by Bin{x,n). 
All m binary strings are saved in code matrix B, and B is the resulted output code. Of 
cause, if the value of d satisfies d >3 , the resulted B is an error-correcting output 
codes. 



CreateTableItem((i, n ) 


SearchCode(<i,»i) 


1 . If n<d Then Return 0; 


1. Initialization: i=0,A = {0},x=l : 


2. Initialization: A = {0}: 


2. n=FindCodeLen(CodeTabte,d,m); 


3. For Each Integer x in [I,!"-!] 


3. while /A/<m && x<2" do 


3.1 Tag=True; 


3.1 Tag^True; 


3.2 For Each Integer y in A 

If DiffBit(Bin(x,n),Bin(y,n))<d 


3.2 For Each Integer y in A 


Then Tag=False; 


If DiffBit(Bin(x,n),Bin(y,n))<d 


3.3 If Tag=True Then A = {x}u A ; 


Then Tag=False; 


4. Return jAj. 


3.3 If rag'=7Vi(e Then A = {x}u A ; 

3.4 x=x+l; 

4. For Each Element j; in A 

B[i]=Bin(y,n),i=i+l ; 

5. Return B. 



(a) (b) 



Fig. 1. Pesudo code of the search-coding method: (a) Pesudo code of creating one item for Co- 
deTable\ (b) pesudo code of searching an output code 

2.2 Properties of the Output Codes 

We can prove that an output code generated by the search-coding method has fol- 
lowing properties: 

1. The output code has no all-zeros columns; 

2. The output code has no all-ones columns; 

3. The output code has no complementary columns; 

4. If there are identical columns in the output code, they must be consecutive; 

5. If there are k identical columns in the output code, then k< d . 

The advantages of our search-coding method are: (1). It can generate output codes 
for a classification problem with any number of classes. (2). Given an expected 
minimum hamming distance, the length of the resulted output code is short. This will 
save training time since the number of binary functions for learning is small rela- 
tively. (3). The process of the method is certain. Therefore our search-coding method 
can be used as a general method to construct error-correcting output codes. 



3 Experimental Results 

Naive Bayesian and Back Propagation Neural Network (BPNN) are two representa- 
tive algorithms in supervised learning field. Naive Bayesian classifier is relatively 
stable with respect to small changes of training data, while BPNN classifier is not. We 
apply our search-coding method to these two algorithms to evaluate its effect on the 
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predictive results of stable and unstable classifiers. In the experiment, both hamming 
distance (HD) and absolute distance (AD) are selected as the distance measure. All 
experimental data sets come from the UCI machine learning repository [7]. The ex- 
pected minimum hamming distance for the search-coding method is 5. The reported 
results are the means and deviations of 10-fold cross validation. 



Table 2. Error results of different naive Bayesian classifiers 



Datasets 


NB-normal 


NB-D 


SCNB (HD) 


SCNB (AD) 


Austra 


19.13 + 4.52 


14.20 + 4.03 


13.91 + 4.49 


14.20 + 4.36 


Bupa 


43.53 + 10.26 


40.59 + 6.01 


35.88 + 5.27 


35.58 + 6.05 


Cancer 


4.06 + 1.65 


2.90 + 1.93 


2.75 + 1.93 


2.60 + 1.74 


Cleveland 


43.67 + 7.45 


42.00 + 5.02 


41.00 + 3.87 


40.33 + 3.67 


Glass 


50.48 + 7.84 


29.04 + 7.26 


32.38 + 7.71 


30.47 + 6.83 


Heart 


15.56 + 4.88 


16.67 + 6.59 


15.18 + 6.06 


14.07 + 5.53 


Iris 


6.47 + 3.22 


6.00 + 5.83 


6.00 + 5.83 


4.00 + 3.44 


Pima 


24.34 + 5.16 


25.13 + 4.49 


23.29 + 3.62 


23.03 + 3.89 


Wine 


2.35 + 3.04 


2.94 + 5.00 


2.35 + 4.11 


2.94 + 5.00 


Average 


23.09 


19.94 


19.19 


18.58 



Table 3. Experimental results of error rate and training time for BPNN and SCBP 



Data set 


BPNN 


SCBP 


error (%) 


time (s) 


HD error (%) 


AD error(%) 


time (s) 


Austra 


15.79 + 4.29 


23.82 


14.35 + 3.64 


13.91 + 4.17 


97.67 


Bupa 


27.06 + 4.96 


19.21 


25.59 + 5.55 


26.76 + 5.08 


86.14 


Cancer 


3.19 + 1.50 


11.19 


3.19 + 1.33 


3.19 + 1.33 


23.25 


Cleveland 


46.67 + 8.16 


9.30 


42.67 + 7.50 


43.67 + 7.10 


153.17 


Glass 


29.52 + 7.37 


11.75 


28.57 + 9.52 


29.05 + 9.90 


38.73 


Heart 


21.11 + 7.66 


8.47 


15.19 + 6.16 


15.92 + 7.21 


55.52 


Iris 


4.67 + 4.50 


0.17 


4.00 + 4.66 


3.33 + 4.71 


1.98 


Pima 


23.15 + 4.77 


22.19 


22.50 + 4.97 


23.28 + 4.80 


102.77 


Wine 


1.76 + 2.84 


0.16 


1.76 + 2.84 


1.18 + 2.48 


0.60 


Average 


19.21 


11.81 


17.53 


17.81 


62.20 



Table 2 is the predictive results of different naive Bayesian classifiers. NB-normal 
is the classifier that uses normal distribution as the conditional probability distribution 
of each class given any continuous attribute. NB-D uses the discretization for each 
continuous attribute. SCNB denote the naive Bayesian classifiers with the search- 
coding method. Since deciding the probability distribution of either class for the bi- 
nary functions in SCNB is almost impossible, we discretize the value range of each 
continuous attribute into several intervals and treat the attribute as a discrete one. 
From Table 2, we can see that NB-normal gets the worst results for 5 data sets, and 
also for the average results. Both SCNB (HD) and SCNB (AD) get better results than 
the other two classifiers for at least 7 data sets, and their average results are also the 
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better ones. For SCNB classifiers, absolute distance measure seems better than ham- 
ming distance measure. 

Table 3 gives the error rate and training time results of the BPNN and SCBP, 
where SCBP denotes the BPNN classifiers based on the search-coding method. Dur- 
ing training process, the learning rate is 0.5 and the momentum rate is 0.9. As a con- 
vergence criterion, we required a mean square error smaller than 0.02. Table 3 shows 
that, for the Cancer data set, BPNN and SCBP get the same error rate, for the other 8 
data sets, SCBP gets less error rate than BPNN. SCBP (HD) has a little better result 
than SCBP (AD). Yet for all of the data sets, SCBP need much longer training time 
than BPNN, this indicates that the error-correcting output codes constructed by the 
search-coding method generate more complex binary functions than one-per-class 
output codes. 

All above results show that the search-coding method can be used to improve the 
generalization for both stable and unstable supervised learning classifiers. Yet for two 
distance measures, it is hard to decide which one is better in our experiments. 



4 Conclusion 

To address the disadvantages of the existing coding methods for ECOCs, this paper 
proposes a general coding strategy: search-coding method. Given any number of 
classes and an expected minimum hamming distance, the method can find out a satis- 
fied output code through searching an integer range. By applying the method to su- 
pervised learning algorithms, experimental results show that this coding method can 
improve the generalization for both stable and unstable classifiers. 

Acknowledgements. This research was supported by the National Natural Science 
Foundation of China, under Grant No. 69825104. 
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Abstract. The problem of partial periodic pattern mining in a discrete 
data sequence is to find subsequences that appear periodically and fre- 
quently in the data sequence. Two essential subproblems are the efficient 
mining of frequent patterns and the automatic discovery of periods that 
correspond to these patterns. Previous methods for this problem in event 
sequence databases assume that the periods are given in advance or re- 
quire additional database scans to compute periods that define candidate 
patterns. In this work, we propose a new structure, the abbreviated list 
table (ALT), and several efficient algorithms to compute the periods and 
the patterns, that require only a small number of passes. A performance 
study is presented to demonstrate the effectiveness and efficiency of our 
method. 



1 Introduction 

A discrete data sequence refers to a sequence of discrete values, e.g., events, 
symbols and so on. The problem of partial periodic pattern mining on a discrete 
data sequence is to find subsequences that appear periodically and frequently 
in the data sequence. E.g., in the symbol sequence “abababab^\ the subsequence 
“o6” is a periodic pattern. Since periodic patterns show trends in time series 
or event sequences, the problem of mining partial periodic patterns has been 
studied in the context of time series and event sequence databases ([l]-[3]). 

Two essential sub-problems are the automatic discovery of periods and the 
efficient mining of frequent patterns. Given a period value, an Apriori-like algo- 
rithm is introduced in [3] to mine the frequent patterns. Han et al. in [2] propose 
a novel structure, max-subpattern tree, to facilitate counting of candidate pat- 
terns. This method outperforms the Apriori-like algorithm, but it assumes that 
the periods are given in advance, which limits its applicability. Berberidis et al. 
[1] has proposed a method that finds periods for which a data sequence may 
contain patterns. However, this method may miss some frequent periods and 
it requires a separate pass to scan the data sequence to compute the periods. 
What’s more, for each symbol, it needs to compute (at a high cost) the circular 
autocorrelation value for different periods in order to determine whether the 
period is frequent or not. 

We observe that a frequent pattern can be approximately expressed by an 
arithmetic series together with a support indicator about its frequency. In this 
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paper, we propose a novel structure, the abbreviated list table (ALT), that main- 
tains the occurrence counts of all distinct elements (symbols) in the sequence 
and facilitates the mining of periods and frequent patterns. Simultaneously, we 
present a fast 0(n) algorithm to identify periods from ALT. 

The paper is organized as follows. Section 2 defines the mining problem 
formally. Section 3 presents the newly proposed approach. Section 4 includes 
performance evaluation of our methods. Finally, section 5 concludes the paper 
and proposes the future work. 



2 Problem Definition 

Let Domain D be the set of elements that can be symbols, events, discretized 
locations, or any categorical object type. A discrete data sequence S is composed 
of elements from D and can be expressed as S = Cq, Ci, ..., e„_i, where i denotes 
the relative order of an element and n is the length of S. Given a period T, a 
periodic fragment Si = eir, e^T-i-i, e(i_|_i) 7 ’_i, (0 < i < [^J), is a subse- 
quence of S, and is the number of fragments in S with respect to period 
T. Element etr+j, (0 < j < T), in fragment Si is at the j-th period position. 
There are T period positions, 0, 1, ...,T — 1, for a given period T. 

Given a period T, a T-period pattern P is a sequence of elements 
Po,Pi,...,Pt-i, (0 < j < T), where pj can be the wild card or an element 
from D. li Pj = *, then any element from D can be matched at the j-th position 
of P. A periodic fragment Si = e^T, e^T-i-i, e(i+i) 7 ’_i, (0 < t < [^J), of a 
sequence S matches pattern P = po,Pi, ...,pt-i if Vj, 0 < j < T, (1) pj = *, 
or (2) Pj = e,T+j- 

A pattern is a partial pattern if it contains the element The length 
L of a pattern is the number of non-‘*’ elements in the pattern. We will call a 
length-L T-period pattern P an T-pattern if the period T is clear in the context. 
P' is a subpattern of a pattern P if it is generated from P by replacing some 
non-'*’ elements in P by '*’. E.g.,‘a * c’ is a subpattern of the 3-pattern ‘abc’. 
Similar to the period position of an element in a fragment, the period position 
of an element in pattern P is also the relative position of this element in the 
pattern. In the 3-period pattern ‘a * *’, the period position of ‘a’ is 0. 

The support of a T-period pattern P, denoted as sup{P), in a sequence S 
is the number of periodic fragments that match P. A pattern P is frequent 
with respect to a support parameter min^sup, (0 < min^sup < 1), iff sup{P) > 
X minsup, (support threshold). If there exists a frequent T-period pattern, 
we say that T is a frequent period. Element e is a frequent element if it 
appears in a frequent pattern. 

The problem of mining partial periodic pattern can now be defined as fol- 
lows. Given a discrete data sequence S, a minimum support min_sup and a 
period window W, find: 

(1) the set of frequent periods T such that 1 < T < IT; and 

(2) all frequent T-period patterns w.r.t. minsup for each T found in (1). 
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3 Mining Using the Abbreviated List Table 

This section describes the method for automatic discovery of frequent periods 
using the Abbreviated List Table (ALT), and a method that performs efficient 
mining of frequent patterns. In phase one, we scan the input sequence to con- 
struct an ALT that records the frequency of every element. In phase two, we use 
the max_subpattern tree [2] to mine the frequent patterns. 

3.1 Abbreviated List Table 

If an element appears periodically for a period T, its positions in the sequence 
will form an arithmetic series, which can be captured by three parameters: 
period position, period and count (frequency). For example, in sequence S = 
‘bdcadabacdca’ , the occurrences of ‘a’ for period = 2 are {3,5,7,11}. We can 
represent them by (1,2,4), where 1 is the period position of the occurrence of 
‘a’ with period = 2, and 4 is the frequency of the corresponding pattern ‘*a’. 
For a given period T, for every element, we need to keep T representations: (0, 
T, county), ... , (T— 1, T, countr-i) ■ We call these representations Abbreviated 
Lists ( AL) . The ALs of all the elements with respect to all periods bounded by 
a period window can be constructed at a single scan of the sequence. 

The algorithm for maintaining the ALT is shown in Fig. la. ALT is a 
3-dimensional table, the first dimension is the element’s index in domain D, the 
second and third dimensions are period and period position, respectively. 



Algorithm ALT1(AZ/T, W. S) 

1. while(S still has elements) { 

2. Read element e from S, 

j f SeqPos is e’s position in S; 

3. Get the index idx of e in Z); 

4. for(p := 1; p < M/; p -\ — h){ 

5. pos: = SeqPos mod p; 

6. ALT[idfc] [p — 1] [pos] H — h;}} 

7. //truncate sequence for all periods 



Algorithm ALT2 {Af/r, W, minsup, n) 

1. for idx < |L>|; idx-\—\-){ 

2. get element e whose index equals to idx-, 

3. for (p~l; p < R'; p++){ 

4. threshold” [n/p\ X minsup-, 

5. for (pos:=0-,pos < p; pos-h+){ 

6. if (ALT[zdj:][p — l][pos] > threshold) 

7. output p, pos and element e; 

8 - }}} 



Fig. la. ALT Maintenance 



Fig. lb. Finding Periods and Fi 



We now show an example for this algorithm. Let the data sequence S be 
“a&aaaccaae” and period window W be 5. For the first ‘a’ at position 0, the 
counters at period position 0 of all the periods are incremented by 1 . Opon seeing 
the second ‘a’ at position 2, for periods 1 and 2, the counters at period position 
0 are incremented. For periods 3, 4, and 5, the counters at period position 2 are 
incremented. The process continues, until we process all the elements in S and 
have the values shown in Table 1. 

While maintaining the Abbreviated List Table, we can compute the frequent 
periods and Fi at any time moment against the length of the data sequence 
scanned. (U is the set of size-1 frequent patterns). Fig. lb shows the algorithm 
to compute the periods and Fi. We still take the ALT in Table 1 as example 
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assuming minsup = 0.8. For period = 2, the threshold is 10/2 x 0.8 = 4, and 
‘a’ at period position 0 is frequent because its count is 4. However, ‘a’ is not 
frequent at period position 1 because its count is only 2. So ‘a*’ is a frequent 
pattern but ‘*a’ is not, and 2 is a frequent period. Similarly, we can find other 
frequent periods 2, 4, 5 and their related F’lS. 

Our method is more efficient than the circular autocorrelation method in [1] 
because of the following reasons: (1) We compute Fi during the period discovery 
process in one pass. (2) Our method works well even when the length of data 
sequence n is unknown in advance. (3) We can find frequent periods directly 
from ALT. 

3.2 Finding Frequent Patterns 

For each frequent period found in step 1, the algorithm constructs the max- 
subpattern tree in step 2 using Fi and the inverted lists. The lists of frequent 
elements are merged to reconstruct the fragments and they in turn are inserted 
into the max-subpattern tree. Finally, we get frequent patterns by traversing all 
the trees. We note two reasons for the superiority of the ALT-based algorithm 
over [2]. (1) This phase does not need to compute Fi. (2) Only the inverted 
lists for frequent elements are needed to build the tree. This requires less I/O 
compared with the original max-subpattern tree algorithm. 

3.3 Analysis 

Space complexity: Assume the period window is W , each element may have 
frequent periods from 1 to W. For each period, we need to record the occurrences 
at each period position. The number of counters in the ALT of an element 
is of O(W^). Given \D\ elements, the space required is 0{\D\W'^), which is 
independent of the sequence length. We expect that W and \D\ are in the order 
of hundreds in practice, thus the ALT can be accommodated in the main memory. 

Time complexity: We need one scan on the sequence to build the ALT. Since 
the locations of ALT entries can be accessed in 0(1) time, the time complexity to 
create the ALT is in the order of 0{n). The construction of the max-subpattern 
tree and the computation of the frequent patterns is also of 0(n). 

4 Experiments 

We compare our method with the algorithm proposed in [1]. We will use 
“ALT 4-tree” to denote our method, and use “Circular Autocorrelation-|-tree” 
to represent the method combining [1] and [2]. All the experiments were per- 
formed on a Pentium III 700MHz workstation with 4GB of memory, running 
Unix. A synthetic data generator is used to generate periodic object movements, 
with four parameters: period T, sequence length n, max pattern length I and 
probability p with which the object complies with the pattern. 

Given data sequence with parameters n = IM, T = 50, p = 0.8 and I = 25, 
it’s obvious to see that our method is much faster than [1] for fixed minimum 
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support 0.6 from Fig. 2a. Moreover, the cost difference rises with the increase 
of the window size. Table 2 records the cost breakdown of the two methods for 
W = 100. Note that the cost difference is mainly attributed to the excessive 
cost of circular autocorrelation in discovering the periods. The finding of Fi also 
contributes some cost difference as [1] needs one more scan on sequence to get 
Fi. The cost for building the max-subpattern tree in our method is less than 
that by scanning the whole sequence since we only need to access the inverted 
lists for frequent elements. Fig. 2b shows that both methods have linear cost to 
the size of the database, due to the limited number of database scans, however, 
our method is much faster than the previous technique. For this experiment, we 
fix parameters T = 100, p = 0.8 and I = 50. 

Table 3 lists the frequent periods found and the number of frequent patterns 
mined from the experiment of Fig. 2b. Since the parameter T used to generate 
data sequence is set to 100 for all sequences and the mining parameter for W is 
100, only one frequent period 100 is found. 




Fig. 2a. Efficiency vs. W 
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Fig. 2b. Scalability 
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2.12 


0 
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n 


period 


num of freq. pat. 


lOOK 


100 


12 


500K 


100 


42 


lOOOK 


100 


76 


2000K 


100 


133 



Table 1. ALT for a Table 2. Cost comparison Table 3. 



5 Conclusion 

In this paper, we presented a new method to perform partial periodic pattern 
mining on discrete data sequences. Using the proposed ALT structure, we can 
find frequent periods and the set of 1-patterns during the first scan on data se- 
quence. This step is much more efficient compared to a previous approach, which 
is based on circular autocorrelation. Further, frequent patterns are discovered 
by inserting fragments to the max-subpattern tree, using the inverted lists in 
order to avoid accessing irrelevant information. Our experiments show that the 
proposed technique significantly outperforms the previous approaches. 
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Abstract. Health databases are characterised by large number of 
records, large number of attributes and mild density. This encourages 
data miners to use methodologies that are more sensitive to health un- 
dustry specifics. For conceptual mining, the classic pattern-growth meth- 
ods are found limited due to their great resource consumption. As an 
alternative, we propose a pattern splitting technique which delivers as 
complete and compact knowledge about the data as the pattern-growth 
techniques, but is found to be more efficient. 



1 Introduction 

In many countries, health care undustry is challenged by growth of costs asso- 
ciated with the use of new treatments or diagnostic techniques and inefficient 
health care practices where funds are unnecessarily spent with no additional 
benefits to patients. It has become very important for health service administra- 
tors to better understand current health care trends and patterns and associated 
costs to estimate health costs into the future. The key characteristics of a health 
system are hospital care, visits to medical practitioners, the consumption of 
pharmaceuticals calculated with regards to the particular cohorts of patients. 
One of the measure units for such calculations is episode of care [8] , which has a 
variety of definitions. Episodes take into account various indices of patient care, 
for instance, patients age, ethnical background, gender, location, medical ser- 
vices provided, information about participating physicians, fees and some other. 
Aggregating these attributes is important for Medicare (Australia’s universal 
health scheme) administrators because they can then produce extensive reports 
on utilisation. From a data mining point of view, applying some definition of 
episode is a way to preprocess data according to some temporal principle that is 
also clinically meaningful. Besides, it is an opportunity to filter out those irrel- 
evant attributes that will not be included in data analyses. Episodic mining of 
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health data is also a method to compress transactional dataset into a collection 
of health care episodes, that are not so diverse due to the nature of services and 
limited variance in clinical practice. 

We define episode of care as a set of one or more medical services received by 
an individual during a period of relatively continuous contact with one or more 
providers of service, in relation to a particular medical problem or situation. 
Episodes of care should be carefully distinguished from episodes of illness. Care 
episodes focus on health care delivery whereas illness episodes focus on the pa- 
tient experience. Episodes of care are the means through which the health care 
delivery system addresses episodes of illness [8]. Construction of an episode of 
care begins with the first service for a particular condition and ends when there 
are no additional claims for a disease-specific number of days. 

In the database used for our analyses, for 3,617,556 [9] distinct patients only 
368,337 unique patient histories were matched. Applying our definition of a 
health care episode as a group of tests ordered for a patient by the same doctor 
on the same day, which in terms of database is the content of all records contain- 
ing the same Patient Identification Number, the same Referring Provider, and 
the same Date of Reference, we represented one of the datasets originally contain- 
ing 13,192,395 transactions as a set of 2,145,864 sequences (episodes). Amongst 
them only 62,319 sequences were disctinct. Our experience in processing admin- 
istrative health data has shown that unique health care episodes normally occupy 
less than 10% of the total size of data, which makes episode-based representation 
an efficient technique of a database compression. Thus effective pruning of the 
original data is suggested to be a starting point in handling computations on 
large datasets. Besides that, the obtained knowledge about diversity and consis- 
tency in data is a valuable contribution in understanding the actual meaning of 
data. This also contributes to the knowledge representation in general [9]. 

The patterns of practice derived from administrative health data is a way 
to gain some insights into the clinical side of health care services. Medicare 
transactions do not contain information about any effects of clinical treatments. 
Neither do they contain information about pre-conditions of the treatments or 
duration of the disease. Medicare items combinations include various mixes of 
consultation, diagnostic and procedural services provided by health providers to 
patients for various pathological conditions. Thus, Medicare items and possibly 
other relevant attributes associated within one episode could reveal some clinical 
side of the event. 

One approach to identifying patterns in health data uses association rule min- 
ing [1][7]. The Apriori-like approaches [5] for discovering frequent associations 
in data achieve reasonable performance on a variety of datasets, but for large 
health records collections in particular this method is found limited. Another 
type of approaches emerged from Formal Concept Analysis [10], a field that fo- 
cuses on the lattices structures extracted from binary data tables, or concepts, 
which provide a theoretical framework for data mining, conceptual clustering 
and knowledge representation in general. 

In fundamental philosophies, concepts combine things that are different. Im- 
plicit in this is the knowledge that the attributes to which our concepts apply 
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have many qualities. Thus, implicit in the nature of concepts is recognition that 
each referent of a concept differs quantitatively from the others. Like in fun- 
damental philosophies, in Formal Concept Analysis a concept is constituted by 
two parts: extension, consisting of all objects belonging to the concept and 

intension, containing all attributes shared by the objects. Such a view at a con- 
cept allows to detect all structures in a dataset containing complete information 
about attributes. All together, these structures present a compact version of the 
dataset - a lattice. Building a lattice can be considered as a conceptual clustering 
technique because it describes a concept hierarchy. In this context, lattices ap- 
pear to be a more informative representation comparing with trees, for instance, 
because they support a multiple inheritance process (e.g. various types of service 
by the same type of health care provider). 

The classic and one of the most efficient techniques for frequent pattern 
mining is FP-growth algorithm and alike [2] . It is an unsupervised learning tech- 
nique for discovering conceptual structures in data. Its benefits are completness 
and compactness, that is, the derived associations contain conclusive information 
about data and their amount is reduced down to the number of maximal frequent 
patterns. However, on a large scale this technique may face memory problems 
due to a great FP-tree expansion. We suggest an alternative algorithm based 
on splitting the initial records into two or more sub-records, so that none of the 
sub-records could carry on irrelevant information. Such an expanded record is 
in fact a mini-structure (or a concept) that already is or will become one of the 
components of a formal concept (or Galois lattice) later on. 



2 Definition of Formal Concept 



Galois connection in its nature is a characteristic of the binary relations that 
possess structural and logical properties, and can therefore be used as a tool to 
relate structures. Galois connection defines how one structure abstracts another 
when the relation may be presented as a function [10]. Some of the relations 
between patterns in health domain can contain functional properties. Health 
databases typically have many types of relations {relation in a database is a set 
of instances, where instance is a vector of attribute values). 

Let us denote a database as V = {0,1, TV), where O and I are the finite sets 
of objects and items respectively. 7^ is a binary relation TZ C O x I. A triple 
T> = {0,X,TV) is called a formal context, where elements of O are objects, those 
of TZ are attributes, and I is the incidence of the context T> = {0,X,TZ). 

Definition 1 (Galois closure operator). 

The Galois closure operator h = f o g is the composition of the applications f 
and g, where f associates items common for all objects o € O with OC O, and 
g associates objects related to all items i € I with an itemset JC X: 

f-.2°^2^ /(0) = {^GI|VoeO,(o,^)G7^} 

g-.2^^2° g{I) = {o€0\ii€ I,{o,i) €TZ} 
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The following properties hold for all /, /i, /2 Cl and O, Oi, O 2 C O: 



1) . /i c l 2 ^ g{h) 2 gih), Oi c O 2 ^ /(Oi) 3 /(O 2 ) 

2) .0Cg{I)^ICf{0) 

The Galois connection (f,g) has the following properties: 

Extension: I C h{I) 

Idempotency: h{h{I)) = h{I) 

Monotonicity: I\ G h{Ii) C h^I^) 

Definition 2 (Formal concepts). 

An itemset C C I from T> is a closed itemset iff h{C) = C . The minimal 
closed itemset containing an itemset I is obtained by applying h to I . h{I) is 
the closure of I . The collection of all closed itemsets makes a Galois lattice, or 
formal concept. [11] [12]. 

Table 1. Dataset WA/2000 (SEp - size of selection of episodes; UniEp - number of 
unique episodes; DIt - distinct items) 



Table 


Size 


Episodes 


SEp UniEp 


DIt 


|WA/2000]6.4Gb| 


3,220,324| 


223Mb|212,402| 


942 



3 Data Preprocessing 



Table 1 describes a dataset used in our experiments. 

This table represents the group of general practitioners and specialists refer- 
ring both in- and out-hospital patients. 

Our testing platform has been the Sun Enterprise with 12 4.00MHz Ultra- 
SPARG processors each with 8Mb of external cache, 7Gb of main memory, run- 
ning Solaris 2.6. It has 2 Sun A5100 disk storage arrays (each 250 Gb capacity) 
which are fibre channel connected and run in a mirrored configuration. A single 
process user job would normally have a full use of a single processor. 

Let us consider some example of episodic presentation. Let us take the fol- 
lowing set of transactions where tid is transaction ID, doc is a doctor ID, dor is 
a date of referral, dos is a date of service and svc is a code of a medical service 
provided [4]. A health care episode will be defined by the doctor delivering the 
initial health care services to an identified patient on the same day. 



tid 


pin 


doc 


dor 


dos 


SVC 


1 


007 


23 


9555 


9557 


65007 


2 


007 


23 


9555 


9558 


73009 


3 


008 


104 


10111 


10112 


10900 


4 


001 


53 


9118 


9123 


65007 


5 


001 


53 


9118 


9127 


73009 


6 


005 


99 


9173 


9174 


65007 


7 


005 


99 


9173 


9174 


73009 
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The first step in preprocessing our small transactional set will be detecting 
episodes of care. According to the definition of an episode, we can identify four 
health episodes and map them into the following structure. 



episode id 


content 


(007, 23,9555) 
(001,53,9118) 
(005,99,9173) 
(008, 104, 10111) 


65007 73009 
65007 73009 
65007 73009 
10900 



In the table above, three out of four episodes have the same content. It makes 
sense to store such a structure in a hash-table, where one table will be storing 
all the unique episodic contents and the other one will be storing their counts. 



episode 


count 


(65007, 73009) 


3 


(10900) 


1 



Since our purpose is to discover patterns of practice (or concepts) in health 
data, the episodic-based approach to preprocessing allows to select and store 
only relevant attributes. Storing only unique episodes in a hash-tahle gives us a 
compressed presentation of our original set of transactions [3]. Having presented 
original records in episodic form, we will still treat episodes like records, or a 
database. 



4 PSAlm: Pattern Splitting Algorithm 

Let us expand our small collection of episodes from the previous example. A 
number of items associated with the same object will be an episode. Keeping 
our notations, the set of objects is O = {oi, 02 , 03 , 04 , 05 } and the set of items 
is / = where ii = 104 , *2 = 302,13 = 304 , *4 = 

10900,15 = 55853,^6 = 65007 , 17 = 66716,78 = 73009, ig = 73011 [4]. 



objects 


items 


count 


Ol 


65007, 73009 


3 


O2 


10900 


1 


03 


65007,66716,73009,73011 


1 


04 


104,302,55853 


1 


05 


302, 304, 55853, 65007, 66716 


1 



Let us set up a support threshold minsup = 2. Those patterns whose oc- 
currence in data equal to or exceeds 2 will be considered as frequent. In the 
technique we propose, the first step is to detect all frequent ^-itemsets first. 
These are (65007,73009), (302,55853) and (65007,66716). All other ^-itemsets 
are infrequent and are therefore irrelevant. This means that 03 , for instance, can 
be split into sub-patterns (65007,66716,73009) and (65007,66716,73011) since 
(73009, 73011) is irrelevant, and then again, each of those sub-patterns will have 
to be reduced down to (65007,66716) because (65007,73011), (66716,73011) 
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(and the same operation for the second sub-pattern), are infrequent, therefore 
their presense in 03 is unnecessary. Following this principle, we obtain a reduced 
version of our collection of episodes: 



pattern id 


items 


count 


1 


65007, 73009 


5 


2 


302, 55853 


2 


3 


65007, 66716 


2 



With large collections of episodes, this process will involve further splitting op- 
erations, with 3-itemsets, 4-itemsets and so forth, depending on the lengths of 
episodes and their number. In this example, we have detected all frequent pat- 
terns in just one step. The derived patterns (65007,73009), (302,55853) and 
(65007,66716) define a, formal concept for our dataset. In particular, the binary 
relation (01O3, ieis) is a concept, (04O5, 12^5) is a concept as well, and for instance, 
( 05 , 12 * 5 ) is not a concept since it does not contain complete information about 

* 2 * 5 - 



5 Computational Complexity 



Let us go through some computations using a small database for example. Pre- 
sented in a form of a hash-table, the dataset is as follows: 



episode (key) 


count (value) 


Jl, 72, 73, 74, 75, 76, 77, 78 


1 


71,72,73,75,76 


1 


73,74,76,77,78 


1 


75,76,77,78 


1 


77 


1 


75,76,78 


1 


71,72,73,74 


1 


72,73,74 


1 



Let us set the minimal support threshold equal to 3. From this table, it is possible 
to identify frequent 1 -items with linear complexity: 



item 


13 


16 


12 


14 


15 


18 


11 


17 


count 


5 


5 


4 


4 


4 


4 


3 


3 



The table below lists frequent ^-itemsets. Other ^-itemsets composed of the list 
of frequent 7 -items are infrequent and will be used further on for splitting our 
original records. 
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2-itemset 


count 


71, 72 


3 


71,73 


3 


72,73 


4 


12, M 


3 


13, M 


4 


73,76 


3 


75, 76 


4 


75,78 


3 


76,77 


3 


76,78 


4 


77, 78 


3 



In order to split the original records into two or more sub-records the PSAlm 
algorithm requires multiple use of the following procedures. 

— A record is tested if it contains one or more infrequent pairs. 

— If the record does contain such a pair it undergoes a recursive splitting 
procedure. 

Complexity of the first procedure depends mainly on a length of a record. 
The consecutive matching the record with a key in a dictionary containing 
infrequent pairs of items takes 0(1) time. Let us denote an average records 
length as V" . The number of possible pairs of items contained in a record is 
r * {F — l)/2. Therefore, to check a record for presence of any of the infrequent 
pairs takes Sb = F * {F — l)/2 steps. For an infrequent pattern of length l'^, 
Sb = , which is of order 0(m). The number of calls for the second 

procedure depends on F and the number of infrequent patterns contained in the 
record. In the worst case, when a record could be of length m and all possible 
pairs of items in it were infrequent, then the algorithm would require at most 
2 *(m- 2 )i ~ calls for splitting procedure. This makes the first procedure 

of 0{m?) complexity, for the worst case. 

Complexity of the second procedure depends mainly on the length of the 
record, and as a result of recursive splitting, on the length of the list of 
sub-records (L®). Thus, complexity of the whole procedure can be estimated 
as linear, or 0(L®). L'* depends on F and the number of infrequent pairs in 
the record, which defines the number of calls for the splitting function. But 
the records can not be split if the resulting sub-records have the length less 
than I'" . Thus, as l'^ increases, the potential maximal number of sub-records 
decreases. Also, not all the sub-records contain infrequent item combinations, 
therefore, only some of them undergo further splitting. Since we are not 
interested in sub-records with length less than Z’", to roughly evaluate the 
potential maximal number of sub-records L® we could use the approach 
suggested in [108], and in our case, such a number could be close to 2"*“^”'. 
But this number would be good only in case if we tried to split the sub- 
records applying all known infrequent pattern at once. Since we apply only one 
infrequent pattern at a time, we estimate that L® may vary from 0 to ; u[m-i )! • 
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The computations described above are the internal loops of the main computa- 
tional procedure that includes scanning the original (and then modified) dataset 
and splitting its records, if necessary. It has two input parameters - a set of 
records and a set of infrequent patterns of fixed length. For our example, after 
splitting the original records using the set of infrequent ^-itemsets, we have 
to obtain a set of infrequent 5-itemsets to continue splitting. This procedure 
requires computations of linear complexity and as a result we obtain a set of 
frequent 5-itemsets (below) and an empty set of infrequent 5-itemsets, which 
halts the computational process for our example. Otherwise, we could continue 
with a set of splitted records and the newly computed set of infrequent patterns. 



3-itemset 


count 


11 , 12,13 


3 


12 , 13,14 


3 


75,/6,/8 


3 


76, 77, 78 


3 



These patterns are supersets of the frequent ^-itemsets therefore the resulting 
table of maximal frequent patterns is as follows: 



concepts 


count 


71,72,73 


3 


72,73,74 


3 


75,76,78 


3 


76,77,78 


3 


73,76 


3 



To evaluate the overall complexity of the process of splitting records and updat- 
ing the original data, we can consider the following scheme (Figure 1): 

Using such a scheme, it is easier to see that the major computational procedure 
in the PS Aim is of the following complexity: 



S' 



BR 



k * n * 



r\ 



*L‘^ = k*L^*n* 



n 



It is important to note that n is the number of records in the original dataset. 
But at every step, records of length less than the current value of P get deleted 
from data. Therefore, n decreases from step to step considerably. 



We estimate that the PSAlm algorithm is of 0{kL‘^m?n) complexity, in 
the worst case, where k is the number of data scans, usually just several, L® is 
a constant. Thus we could generalise the algorithm’s complexity as 0{m?n). 
In the worst case, when the number of scans get close to m, or, in other 
words, when the data is extremely dense , so that gets close to m, then the 
algorithm’s performance will have O(m^n) complexity. 



The difference in efficiency between FP-growth and PSAlm becomes espe- 
cially notable on large data collections (Figure 2), with high number of 



Conceptual Mining of Large Administrative Health Data 667 




Fig. 1. Main computational process for the PSAlm algorithm 



attributes and low support threshold (less than 1%). The most time consuming 
procedure is the first step of the algorithm, when original records go through 
recursive splitting into 2 or more sub-records. Since the sub-records are subject 
to reduction, the dataset gets reduced in size rather significantly, and all further 
splitting procedures are typically effortless. 



Computational Time vs MinSup for WA/2000 Data (3,220,324 Episodes) 




Minimal Support, % 



Computational Time vs MinSup for WA/2000 Data (3,220,324 Episodes) 
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Fig. 2. Performance of the Python codes for FP-growth and PSAlm on WA/2000 health 
data. 



We estimate, that the FP-growth algorithm requires 0(2'"n) steps, in the 
worst case, where m is the number of attributes and n is the number of records. 
It is a main memory based mining technique, and its complexity in space con- 
sumption may reach 0(2™). Although, each path in the FP-tree will be at least 
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partially traversed the number of items existing in that tree path (n in the worst 
case) times the number of items in the header of the tree {m)[2]. Therefore, the 
upper bound on complexity of searching through all paths will be 0{m^n). Tak- 
ing into account that all paths containing an itemset with an item that is the 
header-item at the moment and that it is not possible to see, before the traversing 
along paths, which itemsets turn out to be frequent, the computational complex- 
ity rises up to 0(2™n) in the final phase of the algorithm. Trivial examples in 
the literature on pattern- growth techniques do not show these assertions. It only 
is possible to see in the large scale experiments. In addition, some recursions in 
counting present an additional computational barrier. Thus, overall we regard 
the FP-growth's complexity as exponential or close to one. 

6 Conclusions 

Pruning the original data, adequate representation of ordinary transactions as se- 
quences containing specifically selected (or relevant) items, storing and analysing 
only unique sequences are suggested to be effective techniques for dataset com- 
pression and knowledge representation. 

The suggested pattern-splitting technique is found an efficient alternative 
to the FP-growth and similar approaches, especially when dealing with large 
databases. 

Formal Concept Analysis (FCA) is an adequate learning approach for dis- 
covering conceptual structures in data. These structures present conceptual hi- 
erarchies that ease the analysis of complex structures and the discovery of regu- 
larities within the dataset. FCA is also a conceptual clustering technique useful 
for multipurpose data analysis and knowledge discovery. FCA is found to be a 
suitable theory for the determination of the concept of a concept. 
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Abstract. In this paper, we treat the problem of the grammatical tagging of 
non-annotated corpora of specialty. The existing taggers are trained on general 
language corpora, and give inconsistent results on the specialized texts, as 
technical and scientific ones. In order to learn rules adapted to a specialized 
field, the usual approach labels manually a large corpus of this field. This is 
extremely time-consuming. We propose here a semi-automatic approach for 
tagging corpora of specialty. ETIQ, the new tagger we are building, make it 
possible to correct the base of rules obtained by Brill’s tagger and to adapt it to 
a corpus of specialty. The user visualizes an initial and basic tagging and cor- 
rects it either by extending Brill’s lexicon or by the insertion of specialized 
lexical and contextual mles. The inserted rules are richer and more flexible than 
Brill’s ones. To help the expert in this task, we designed an inductive algorithm 
biased by the "correct" knowledge he acquired beforehand. By using tech- 
niques of machine learning and enabling the expert to incorporate knowledge 
of the field in an interactive and friendly way, we improve the tagging of spe- 
cialized corpora. Our approach has been applied to a corpus of molecular biol- 
ogy- 



1 Introduction 

Knowledge extraction starting from raw and specialized texts is not yet really practi- 
cal. In order to step forward on this problem, several related steps must be carried out: 
normalization, grammatical tagging, terminology extraction, and conceptual cluster- 
ing. Grammatical tagging is a key step in knowledge extraction because its precision 
strongly influences the results of the following steps in the chain of linguistic treat- 
ments. Before doing the grammatical tagging, however, we could observe that it is 
necessary to normalize the corpus. The normalization of texts consist of several types 
of treatments [1] highly dependent upon the specialty and the type of information to 
extract. The goal of the normalization is to limit the errors of the following stage 
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(grammatical tagging) of the text processing and to prepare the ultimate step of 
knowledge extraction by reducing the complexity of the vocabulary used. 

Once the corpus is standardized, the tagging can be carried out. This step consists 
in associating each word with its grammatical tag, according to its morphology and 
context. We used the whole labels list of Penn TreeBank. This list is composed of 36 
part-of- speech tags and 9 others for the punctuation. A complete list can be down- 
loaded at http://www.cis.upenn.edu/~treebank/home.html . 

Some tags carry less semantics than others when they are widely used by all lan- 
guages, such are the determiners (DT, ‘a’, ‘an’, ‘the’), tags attached to both kinds of 
words: very general or very specific ones (such as the adjectives, JJ ). Inversely, some 
tags are associated to highly significant words, such as the nouns. 



Table 1. Examples of PoS (Part-of-Speech) tags containing specialized words. 



PoS tag 


Signification 


Examples 


JJ 


adjective 


heterologous, mitotic, molecular 


NN 


noun, singular or 
mass 


genome, replication, RNA-splicing 


NNS 


noun plural 


genomes, RNAs, strains 


VBG 


verb, gerund/present 


DNA-damaging, overlapping, oxidiz- 




participle 


ing 


RB 


adverb 


evolutionarily, post-translationally, 
structurally 



2 Grammatical Taggers 

Among the data-driven algorithms used for tagging, we can cite Inductive Logic 
Programming [2, 3, 4], Instance-Based Learning [5, 6], Transformation-Based 
Learning [7] and statistical approaches [8, 9]. Other sophisticated techniques were 
used, based on the combination of several kinds of taggers. These techniques are 
based on the fact that differently designed taggers produce different errors. Thus, the 
combined tagger shows a higher precision than any of them on its own [10, 11]. 

Rule-Based Supervised Learning (or Transformation Based Learning) 

Brill’s tagger [7] uses a rule-based supervised learning algorithm which detects and 
corrects automatically its errors, providing a progressive and automatic improvement 
of the performance. The tagger functions in two steps. The learning step only takes 
place once, it is slow and complex. The second step which is the application of the 
learned rules, is used very often. It is fast and simple. 

It will be useful to recall Brill’s learning system here, even though many authors 
already used this approach. It works as follows. 

The text is annotated by an expert who assigns grammatical tags to each word of 
the text on which the training is carried on. At each iteration, the algorithm selects 



672 



A. Amrani, Y. Kodratoff, and O. Matte-Tailliez 



only one rule among all the possible ones, the selected rule is the one which provides 
the best tagging improvement of the temporary corpus. The new temporary corpus is 
obtained by applying the rule learned to the current temporary corpus. This process is 
repeated until no rule can be found with a score higher than a given threshold value. 
At the end of the process, a list of ordered rules is obtained. Brill’s tagger uses a rule- 
based supervised learning in two successive modules: the lexical module (first mod- 
ule) learns lexical (morphological) rules to label the unknown words. In the contex- 
tual module (second module), tags and word morphology of the context are used to 
improve the accuracy of tagging. 



3 Tagging Specialized Corpora 

Whatever the technology on which they are based, the current taggers obtain a very 
satisfactory level; the published results are usually about 98% of correct labels, and 
even higher figures can be found. These good performances can be explained because 
test corpora are of type similar to training corpora. Obviously, some specific work 
has to be done in order to adapt them to a specialized corpus. 

The present work is relative to a molecular biology corpus. This corpus was ob- 
tained by a Medline request with the key words "DNA-binding, proteins, yeast’, re- 
sulting in 6119 abstracts (10 megabytes). After having carried out Brill’s tagging, we 
note several problems: 

- until many iterations are performed, many cleaning mistakes become obvious 

- the technical words are unknown (not listed in the general lexicon) 

- Brill’s rules are not adapted to a specialized corpus. A possible solution would 

consist in obtaining an annotated corpus for each specialty. That would require 
intensive expert work. This is why we could not start from an annotated cor- 
pus when we chose a specialized domain. 

Another limit of the rule-based taggers is that the rule formats are not sufficiently 
expressive to take into account the nuances of the specialty language. Here we intro- 
duced into our tagging rules the possibility of using an enriched grammar. The spe- 
cialist can define new labels, using logic descriptions or/and regular expressions to 
express his knowledge. One of our objectives is to produce methods applicable what- 
ever the specialized domain, with the least possible manual annotation. 



4 Description of ETIQ Software 

The solution we propose is to adapt a tagger, starting from a "general" corpus, to a 
corpus of specialty. We thus basically preserve Brill’s tagger as the starting point of 
our system. For the English version, the program was built on the annotated corpus of 
Wall Street Journal, which is of very different nature than our molecular biology 
corpus. We applied the ordered list of Brill’s rules to our corpus. Then ETIQ enabled 
the expert to display the result of Brill's tagging, to add lexical and contextual rules 
and to supplement Brill’s lexicon by a specialized lexicon. To add the 'Hth rule, the 
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expert only had to observe the current state of the corpus after the execution of the 
(N-1) preceding rules [12] or possibly to look at the results of another tagger. Spe- 
cialized rules can be written and carried out in a simple way, which significantly 
reduces the time and the effort of the user. 

In a preliminary stage, and to adapt a general tagging to a domain of specialty, the 
expert inserts a specialized lexicon, i.e., a list of specialized words, where each one is 
followed by its possible tags. This means that when a word has several possible tags, 
such novel (NN, JJ), one of its tags will be met more often in the corpus. It is also 
possible that only a single one will be met. 

Our strategy is as follows: we start with Brill's lexical stage, then the user visual- 
izes the tagging results and corrects it by adding specialized lexical rules. The corpus 
is then treated with Brill’s contextual stage. Finally the user visualizes and corrects 
Brill's contextual tagging by adding specialized contextual rules. 

4.1 Lexical Tagging 

In this module, the goal is to determine the most probable tags of the words unknown 
to Brill's lexicon by applying specialized lexical rules. These words are then added in 
a specialized lexicon. The ETIQ software helps in correcting the grammatical tagging 
while allowing the user to assign new labels. For instance, the word ABSTRACT is 
tagged as a proper noun by Brill’s tagger because its rules state that words starting 
with a capital are proper nouns. An other important feature of ETIQ is that the PoS 
tag list can be increased by others tag types, specific for the studied field, for example 
the label formula (typed FRM) or citation (CIT). To help the expert identify the tag- 
ging errors, the system shows groups of words (and their tags) on the base of any 
similar morphological feature (Eor example, words having the same suffix or words 
corresponding to the same regular expression). According to the detected errors, the 
expert inserts the adequate lexical rules to correct these errors. An example of lexical 
rule we had to introduce is: Assign the tag Adjective to the words having the suffix al. 
The used rules are more flexible than the ones in Brill’s tagger. Brill’s tagger assigns 
to the word its most probable tag according to simple conditions like the nature of its 
prefix, the nature of its suffix and its contents. The grammar of our rules enables the 
combination of the simple conditions used by Brill and the regular expressions. The 
expert writes precise rules according to given conditions. 

The grammar used in ETIQ for the lexical rules is as follows: 

<Lexical_rule> ::= if < Sequence_Condition > then <Action> 

<Se- 

quence_Condition> : : = < Condition><Logical_OperatorxSequence_Condition> 

I NOT <Sequence_Condition> 

I (<Sequence_Condition>) 

I <Condition> 

<Condition> ::= The word has the morphological property m. 

<Action> ::= Change the tag A to B \ Change the current tag to B 
< Logical_Operator> ::= AND\ OR 
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Let us compare two lexical rule examples respectively from Brill’s system and from 
ETIQ: 

- Brill’s rule: JJ ery fhassuf 3 NN. It uses only one morphological condition. This 

means: if the end of word is ery (ery fhassuf) then current tag adjective (JJ) is 
changed to noun (NN). 

- ETIQ’s rule : Nul ( char -gene AND hassuf ions ) NNS. This rule combine two 

morphological conditions. This means: if the word contains -gene (char -gene) 
and its end is ions then the current tag is changed to NNS. 

The expert visualizes the effect of the rule he just wrote and can modify it before 
saving it. The writing, the modification, and the checking of each rule are done in a 
simple way by using a user-friendly interface. The interface is intuitive, and can be 
used by people who are not computer scientists. 



-1*1 XI 

Lexlisjl itijnf I Ciwteiauri mgo | C«jn<iotiin< iiiluLliwi | Tn« | Tdxu»iu»r»y | Ew<lualii»i [ 




Fig. 1. ETIQ : Lexical step. On the left, a list of words with the suffix ion. On the right, lexical 
rules introduced by the expert. The highlighted line signifies set the tag of all words with suffix 
ion to NN (noun). 

Some of the words are sufficiently specialized to be tagged by a unique tag. For 
example, the word chromosome is always tagged as noun. The expert can freeze these 
tags so that they cannot be changed during the contextual phase. 

If the expert cannot choose a label for a word during this phase, then this word re- 
quires a contextual tagging. These contextual rules are applied to the words having 
several labels in the corpus (thus, different semantic meanings), or to a group of 
words (for example with the suffix ing) that have different labels according to the 
context (NN, JJ, VBG). 
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4.2 Contextual Tagging 

Once the expert has finished the lexical stage, contextual rules can be used to improve 
it. The context of the word is defined by the morphology and tags of the neighboring 
words. In a similar way to the lexical module, the expert can, in an interactive way, 
carry out requests to look for words, according to words, tags, and morphological 
criteria. This enables him to visualize the contexts (the target word and its neighbors) 
and to detect the errors. The user can thus correct these errors by inserting specialized 
contextual rules. The conditions of the rule can be generated automatically from the 
request. The expert can refine it if it is necessary. 

Let us give two examples needing contextual rules as these words have two possible 
tags: 

- Possible tags of functions are: NNS (noun, common, plural) and VBZ (verb, 

present tense, 3rd person singular) 

- Possible tags of complex are: JJ (adjective) and NN (noun, common, singular) 
Compared to Brill’s system, which is limited to a context either made of the three 

preceding words or the three following words, ETIQ explores a context made of both 
the three preceding and the three following words. 

Let us give the contextual rules grammar used in ETIQ: 

< Contextual _Rule:> ::= if < Sequence_Condition > then <Action> 

<Se- 

quence_Condition>: : = < Condition><Logical_Operator><Sequence_Condition > 

I NOT <Sequence_Condition> 

I (<Sequence_Condition>) 

I <Condition> 

<Action> ::= Change the tag A to B \ Change the current tag to B 
<Condition> ::= <Context> 

<Context> ::= The nth preceding/following word is tagged with t. 

\The nth preceding/following word has the regular expression r. 

I The nth preceding/following word is w. 

I The current word is tagged with t. 

I The current word has the regular expression r 
I The current word is w. 

<Logical_Operator> ::= AND \ OR 

4.3 Semi-automated Tagging 

Since the tagging task is very cumbersome, most approaches use a supervised learn- 
ing technique: texts are annotated by humans (supposedly without mistake) and tag- 
ging rules are then automatically learned from the annotated corpus. This technique is 
possible when dealing with the general language, for which annotated corpora already 
exist. 
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Fig. 2. ETIQ : Contextual step. The left side shows all the contexts of the •word functions with 
the tag NNS (noun common, plural). The right side shows a contextual rule for the highlighted 
functions: if the word is functions with tag NNS and the previous (-2) word tag is WRB (Wh- 
adverb) and the next word tag is CC (conjunction, coordinating) then the new tag is VBZ (verb, 
present tense, 3' person singular). 

Specialty languages, however, ask for expert tagging, a difficult task due to the so- 
cial fact that all experts are overworked. We thus had to develop a tool speeding the 
tagging task in order to allow experts to efficiently and rapidly tag the texts. We pro- 
vide the expert with a user-friendly interface to correct the tagging. The inductive 
system takes into account these improvements in order to offer new rules to him. In 
the so-called "lexical" phase (during which the context is limited to relations between 
the letters inside the word) we use several obvious morphological features such as 
prefix and suffix. For these features we use values (like ion for suffix) suggested by 
the user or chosen automatically by the system on the basis of their frequency. 

Our basic induction algorithm is the classical C4.5 [13] but its optimization meas- 
ure, the gain ratio, has been modified in order to take into account expert-based bi- 
ases. To this effect, we introduced an ‘expert gain’, Gain^^p. This gain is similar to the 
one computed by C4.5: the attribute X is applied to the data, and we compute the 
GaiUjj^p due to application of X. Gainj^^^ is not computed on the whole set of tags but 
only on the set of tags previously changed by the expert. Let T„ be the current learn- 
ing set under study, T^^^^ the set of tags that have been changed by the expert in T„ , N 
the number of instances in the set T„ and c the number of values of the target feature 
(tag). Applying the feature X, which has n possible values, to T„ returns the classical 
C4.5 gain. In order to compute Gaing^.^, we apply X to thus generating n subsets 
^ExpP^Exp2‘ ■ ‘’^Expn’ Let Pos(T,^j,.) (i = 0, 1, ..., n) the number of instances tagged as 
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the majority class of Tp,,pj and Np,^pNeg(Tp,^pj) the number of instances tagged differently 
from the majority class of Tp,,pj. 

Then, when define 


Mexp(Texpo) - Ng,jpPos(Tg,jpQ ) - Ng^pNeg(Tg,jpQ ) 


( 1 ) 


Mexpx( Texpo ) = Sill (N expPos(Te,p3 ) - N ExpNeg(TE,p. )) 


(2) 


Log2(c)*(M (T )-M (T )) 

Gamp 3 /X)= ^ 


(3) 


Gain(X) + GainExp (X) 

Gain Ratio C4.5p„„(X) = 

Split info(X) 


(4) 



This gain ration (Gain Ratio €4.5^^^) selects the feature that favors the classical 
gain ratio and that contradicts the expert as little as possible. We chose the sum be- 
tween the C4.5 gain (Gain(X)) and Gaing^.^ in order to come back to the classical gain 
ratio when the expert provided no advice. This algorithm proposes new rules to the 
expert who presently has to choose the best ones. A partial automation of this selec- 
tion step is presently under study. 



4.4 Towards a Perfectly Tagged Corpus 

One of the expectations of the system is the semi-automatic acquisition of a perfectly 
tagged corpus of specialty. On one hand, ETIQ makes it possible to correct easily and 
manually the few errors introduced by the side effects of the rules. On the other, the 
expert can manually impose correct labels missed by the rules. All that enabled us to 
obtain a "perfect" biology sub-corpus of one megabyte with a minimal involvement 
(half-a-day) of the expert. 



5 Experimental Validation 

In order to validate our approach, we used a sub-corpus of 600 abstracts. 

We generated three tagged corpora starting with our raw corpus, C„. 

C„ annotated (supposedly perfectly) leading to 
Cp tagged by standard Brill’s tagger leading to 
Cp tagged using ETIQ as follows: 

Cp was tagged by Brill’s lexical module, yielding C,. Cj lexical tagging was then 
improved by ETIQ, yielding C^. was tagged by Brill’s contextual module, yielding 
C 3 . Einally, C 3 contextual tagging is improved using ETIQ, yielding 0 ^ 3 ,,^. 
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Table. 2. Tagging results. 



PoS Tags 


# Tags 

c 


terrors 

c 


Precision 

c 


Recall 

c 


# errors 

c 


Precision 

c 


Recall 

c 


CC 


4347 


16 


100.00 


99,63 


5 


99,93 


99,88 


CD 


1787 


305 


65,84 


82,93 


46 


95,45 


97,43 


DT 


11869 


56 


99,30 


99,53 


47 


99,29 


99,60 


EX 


35 


2 


100,00 


94,29 


2 


100,00 


94,29 


FW 


57 


12 


81,82 


78,95 


15 


95,45 


73,68 


FRM 


6417 


6417 


0,00 


0,00 


190 


97,86 


97,04 


IN 


16559 


3 


99,54 


99,98 


3 


99,67 


99,98 


JJ 


11044 


1016 


81,11 


90,80 


458 


98,33 


95,85 


JJR 


116 


8 


93,10 


93,10 


7 


92,37 


93,97 


JJS 


69 


0 


90,79 


100,00 


0 


98,57 


100,00 


MD 


490 


2 


100,00 


99,59 


3 


100,00 


99,39 


NN 


29081 


4995 


97,42 


82,82 


2220 


97,25 


92,37 


NNS 


7618 


153 


94,71 


97,99 


26 


98,50 


99,66 


NNP 


4116 


239 


29,74 


94,19 


230 


66,23 


94,41 


NNPS 


3 


0 


3,61 


100,00 


3 


0,00 


0,00 


PDT 


12 


3 


100,00 


75,00 


0 


85,71 


100,00 


POS 


43 


39 


12,90 


9,30 


0 


61,43 


100,00 


PRP 


1229 


0 


99,76 


100,00 


0 


99,76 


100,00 


PRP$ 


486 


0 


100,00 


100,00 


0 


100,00 


100,00 


RB 


3555 


333 


92,69 


90,63 


33 


99,55 


99,07 


RBR 


68 


11 


100,00 


83,82 


12 


100.00 


82,35 


RBS 


49 


6 


100.00 


87,76 


0 


100.00 


100,00 


RP 


17 


12 


100,00 


29,41 


5 


80,00 


70,59 


SYM 


117 


66 


100,00 


43,59 


22 


100,00 


81,20 


TO 


2242 


0 


100.00 


100,00 


0 


100.00 


100,00 


VB 


1980 


45 


93,21 


97,73 


63 


96,62 


96,82 


VBD 


1851 


14 


94,89 


99,24 


14 


99,19 


99,24 


VBG 


2395 


243 


90,00 


89,85 


72 


91,93 


96,99 


VBN 


4136 


94 


99,04 


97,73 


16 


99,66 


99,61 


VBP 


2272 


89 


99,05 


96,08 


85 


98,20 


96,26 


VBZ 


3518 


120 


98,87 


96,59 


70 


99,54 


98,01 


WDT 


914 


3 


99,89 


99,67 


0 


99,78 


100,00 


WP 


5 


0 


100,00 


100,00 


0 


100.00 


100,00 


WP$ 


22 


0 


100,00 


100,00 


0 


100,00 


100,00 


WRB 


298 


126 


100,00 


57,72 


1 


99,33 


99,66 




6011 


0 


99,90 


100,00 


0 


99,90 


100,00 


" 


67 


67 


0,00 


0,00 


67 


0,00 


0,00 


( 


1634 


0 


100.00 


100,00 


0 


100.00 


100,00 


) 


1637 


0 


100,00 


100,00 


0 


100,00 


100,00 




5101 


0 


100,00 


100,00 


0 


100,00 


100,00 


- 


9 


9 


0,00 


0,00 


0 


37,50 


100,00 


' 


27 


27 


0,00 


0,00 


27 


0,00 


0,00 




272 


6 


91,72 


97,79 


94 


100,00 


65,44 



Note that we improved mainly the tagging of the adjectives (JJ), of the nouns (NN) 
and of some of the verbs (VBG and VBN) that are of primary importance for defining 
a relevant and domain-specific terminology. 
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Our results can be summarized as follows : 

Normalized mean (i.e., a mean weighted by the tag number of a given value) preci- 
sion of CggjLL : 89.26; Normalized mean recall of : 89.12 
Normalized mean precision of 0^^,^ : 97.48; Normalized mean recall of : 97.12 



6 Similar Approaches 

Let us now discuss two similar approaches. ANNOTATE is an interactive tool for 
semi-automatic tagging [14] similar to ours. This system interacts with a statistical 
tagger [9] and a "parser" to accelerate the tagging. Grammatical tagging works as 
follows: the tagger determines the tag of each word on the basis of the tag frequency 
probability, the system then displays the sentences and calls for confirmation or cor- 
rection of the non reliable labels. In this system, the annotator does not write rules. 
The internal model is updated gradually with the confirmations and the corrections 
carried out. These changes are thus exploited immediately. 

KCAT [15] is a semi-automatic annotation tool to build a precise and consistent 
tagged corpus. It combines the use of a statistical tagger, and lexical rules of clarifi- 
cation acquired by an expert. The rules used are limited to word similarity. The 
method enables the correcting of the words whose labels are not reliable. The expert 
does not have to repeat the same action on two words in the same context because 
lexical rules are generated automatically. The expert tags words by lexical rules, the 
remaining words are tagged by the statistical tagger, then the expert has to correct 
directly the unreliable tags. These rules are very specific. In order to obtain a well- 
tagged corpus, the expert must insert a large quantity of tags. This increases human 
costs and slows down the speed of execution. 

In ETIQ, we start with a non annotated corpus of specialty, then we tag the corpus 
with a general tagger, and, subsequently, the expert inserts the most general possible 
rules with the help of an inductive module. Einally, when the expert judges that the 
corpus is correctly tagged, he has the possibility to obtain a perfectly tagged corpus 
by directly (without rules) correcting the wrong labels. 



7 Conclusion and Future Work 

Our ultimate goal is domain- specific information extraction, obtained by a complete 
text-processing chain. Tagging conditions the obtention of a relevant terminology in a 
given speciality domain which is a key preliminary stage [17] to information 
extraction. This explains why we give so much importance to the tagging task. 

In this paper, we introduced a semi-automatic user-friendly tagger (ETIQ) ena- 
bling an expert to deal with non-annotated specialty corpora. Using ETIQ, the expert 
can easily display the result of a basic tagging (carried out by Brill’s tagger). On this 
basis, the user easily corrects the errors by inserting lexical rules and, subsequently, 
contextual rules. In the rule writing task, the expert is assisted by an inductive mod- 
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ule; this module proposes rules learned on the basis built by the expert. To confirm 
the assumptions upon which our system is built, we carried out its experimental vali- 
dation: we compared the labels obtained by our system with those obtained by Brill’s 
tagger and those of a ‘perfect’ corpus. 

Progressive induction starting from the behavior of the expert is an approach which is 
not usually used, but our experience is that the expert finds the proposals of the ma- 
chine acceptable. At present, we only tested this approach for the lexical rules. 

Induction at the contextual stage, and thus a relational one, leads very quickly to a 
combinatory explosion. We will use progressive induction to solve this problem. The 
contextual rules come directly from the detection of the erroneous contexts. We will 
adapt and incorporate (semi-)automatic methods for error detection [16]. These meth- 
ods will enable the expert to target the contexts where errors are more likely, thus 
helping him/her to organize and accelerate his/her work. 

Our goal is to use "extensional induction" on the various tags, i.e., learning in ex- 
tension what a tag is. This extension has an extremely large size since it comprises a 
list of all the strings of letters qualifying for this tag, and of all the contexts that dis- 
ambiguate this string of letters when it can belong to other tags. 
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Abstract. Computational diagnosis of cancer is a classihcation problem, and it 
has two special requirements on a learning algorithm: perfect accuracy and small 
number of features used in the classifier. This paper presents our results on an 
ovarian cancer data set. This data set is described by 15154 features, and consists 
of 253 samples. Each sample is referred to a woman who suffers from ovarian 
cancer or who does not have. In fact, the raw data is generated by the so-called 
mass spectrosmetry technology measuring the intensities of 15154 protein or 
peptide-features in a blood sample for every woman. The purpose is to identify 
a small subset of the features that can be used as biomarkers to separate the two 
classes of samples with high accuracy. Therefore, the identified features can 
be potentially used in routine clinical diagnosis for replacing labour-intensive 
and expensive conventional diagnosis methods. Our new tree-based method can 
achieve the perfect 100% accuracy in 10-fold cross validation on this data set. 
Meanwhile, this method also directly outputs a small set of biomarkers. Then 
we explain why support vector machines, naive bayes, and fc-nearest neighbour 
cannot fulfill the purpose. This study is also aimed to elucidate the communication 
between contemporary cancer research and data mining techniques. 

Keywords: Decision trees, committee method, ovarian cancer, biomarkers, clas- 
sification. 



1 Introduction 

Contemporary cancer research has distincted itself from the traditional one with the 
unprecedented large amount of data and tremendous diagnostic and therapeutic innova- 
tions. Data are currently generated in high-throughput fashions. Microarray-based gene 
expression profiling technologies and mass spectrosmetry instruments are two types of 
such methods in common use [10,18,15,17]. The microarray technologies can measure 
expression levels (real values) of tens of thousands of genes in a human cell [10,18], 
while mass spectrosmetry instruments can measure intensities (also real values) of tens 
or even hundreds of thousands of proteins in a blood sample of a patient [15,17]. As hun- 
dreds of cells or blood samples are usually involved in a biomedical study, the resulting 
data are very complex. In other words, if every gene or protein is considered as a vari- 
able, those hundreds of samples are points located at a very high-dimensional Euclidian 
space with an extremely sparse distribution. These data points are usually labelled with 
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classes, e.g. ‘tumor’ versus ‘normal’. These labelled points can also be converted into a 
very wide relational table. Such kind of data provide both opportunities and challenges 
for the data mining field due to the data’s high dimensionality and its relatively low 
volume. 

In this paper, we focus on a data set derived from ovarian cancer patients [15]. This 
disease is the leading cause of death for women who suffer from cancer. Most treatments 
for advanced ovarian cancer have limited efficacy, and the resulting 5-year survival is 
just 35%-40%. That is, only a small proportion of advanced ovarian cancer patient can 
survive 5 years after the treatment. By contrast, if ovarian cancer is detected when it 
is still at early stage (stage 1), conventional therapy produces a 95% 5-year survival 
rate [17]. So, early detection — correctly classifying whether a patient has this disease or 
not — is critical for doctors to save the life of many more patients. The problem is how 
to use a simple way to make an effective early diagnosis. Mathematically, the problem 
is how to find such a decision algorithm that uses only a small subset of features but can 
separate the two classes of samples completely. 

Petricoin et al. [15] first attempted to use mass spectrosmetry as basis to identify 
biomarkers from blood samples for early detection of this disease. Mass spectrosmetry 
is also called proteomic profiling, meaning to measure the mass intensities of all possible 
proteins or peptides in blood samples. Body fluids such as blood are protein-rich informa- 
tion reservoir that contains traces what the blood has encountered during its circulation 
through the body. So, a group of diseased ovarian patients should exhibit some distinc- 
tive proteomic patterns in their blood samples, compared to those of healthy patients. 
Petricoin and his colleagues [15] collected 253 blood samples from 91 different healthy 
patients and 162 different diseased patients. For each blood sample, they measured and 
obtained the intensities of the same 15154 variables by a so-called PBS-II mass spec- 
trosmetry instrument. They hoped some features could change their intensities in blood 
sample significantly between the healthy and tumor cases. Petricoin et al. [15] divided 
the whole data set into a training subset and a test set, and used a genetic computational 
algorithm to find a subset of features from the training data. Then the identified features 
were applied to the test set to get an accuracy and other kinds of evaluation indexes 
such as sensitivity, specificity, and precision. Their results are reported to be perfect: no 
mis-classifications were made on their test data by their method [15]. 

However, from data mining point of view, the above computational data analysis is 
weak. First, 10-fold (or other numbers of folds) cross-validation is commonly used to 
get an overall reliable accuracy for a learning algorithm on a data set. But, Petricoin 
and his colleagues evaluated their algorithm only on one small part of pre-reserved test 
samples. Second, there are many long-studied well-verified learning algorithms such 
as decision trees [16], support vector machines (SVMs) [3,4], Naive Bayes (NB) [7, 
11], and fc-nearest neighbour (fc-NN) [5] in the field. These algorithms should be also 
applied to the data set to see whether their performance is perfect and agreeable to 
each other, and to see whether there indeed exist a small subset of biomarkers in the 
blood samples of varian cancer patients. So to strengthen the results, in this paper, we 
conduct the following two aspects of data analysis: (i) Using decision trees (Bagging [1] 
and Boosting [9]), SVM, NB, and fc-NN to get 10-fold (also 7-fold and 5-fold) cross 
validation performance on this data set; (ii) using a new decision-tree based committee 
method [12,13] to find biomarkers from the data set. 
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Over non-linear learning algorithms (SVMs, NB, and fc-NN), decision tree based 
algorithms have advantages for finding a small subset of biomarkers. The induction of a 
decision tree from a training data set is a recursive learning process [ 1 6] . In each iteration, 
the algorithm selects a most discriminatory feature, and splits the data set into exclusive 
partitions. If all partitions contain pure or almost pure class of samples, the process stops. 
Otherwise, the algorithm applies to those un-pure partitions. So such a recursive process 
can be terminated quickly. Therefore, even a large number of features are input to a 
tree induction algorithm, the resulting classiher — the tree — may contain just a couple of 
features. That is, any classihcation for any new test samples will need only the features 
in the tree, rather than the whole original features. However, for the non-linear learning 
algorithms, the classifiers have to use the values of all original features in the test phase. 
It can be seen that decision tree based algorithms are a more systematic approach to 
finding a small subset of biomarkers from high-dimensional biomedical data. 

The proposal of our new committee method is motivated by the following three 
reasons: (i) single decision trees do not have high performance; (ii) there exist many 
diversified decision trees in a high-dimensional data set that have similar performance; 
(iii) voting by a committee has much potential to eliminate errors made by individ- 
ual trees. This new committee method has comparable performance with the best of 
other classifiers. This method also provides small sets of blomarkers that can be di- 
rectly derived from the trees. This new committee method is different from Bagging [1], 
AdaBoosting [9], randomized trees [6], and random forest [2]. 

The remaining of the paper is organized as follows; We give a detailed description 
of the data set in Section 2. In Section 3, we promote the use of decision tree-based 
algorithms for finding biomarkers from high-dimensional data, we also introduce a new 
committee method with technical details. In Section 4, we briefly describe 3 non-linear 
learning algorithm SVM, NB, and fc-nearest neighbour, and explain why these non-linear 
algorithms cannot fulfill some important biomedical research purpose. In Section 5 and 
Section 6, we present 10-fold cross-validation results on the ovarian cancer data set. 
SVM and our new committee method can both achieve the perfect 100% accuracy. We 
also report the biomarkers identified by our new method from the 253 blood samples. 
In Section 7, we discuss the difference between our new committee method and state- 
of-the-art committee algorithms. Finally, we conclude this paper with a summary. 

2 Data Set Description 

The whole data set consists of two classes of samples: 91 ‘Normal’ samples and 162 
‘Cancer’ samples [15]. A ‘Normal’ sample refers to a blood (more precisely, serum) 
sample taking from a healthy patient; while a ‘Cancer’ sample refers to a blood sample 
taking from an ovarian cancer patient. Each sample is described by 15154 features. 
Each feature is a protein or peptide (more precisely, a molecular mass/charge identity) 
in blood sample, taking continuous real values equal to the intensity of this variable in 
every particular patient’s blood. If a sample is represented by a vector (pi,p 2 , ■ • ■ ,Pn), 
where pj,l < j < 15154, is the intensity of the jth feature, then there are total 253 (91 
plus 162) such long vectors in this data set. 

The raw data are available at a public website 
http://clinicalproteomics.steem.com/download-ovar.php (as of Novem- 
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her 12, 2003). As the raw data does not match the common format used in 
data mining and machine learning, we have performed some pre-processing and 
transformed the raw data in correspondence with the .data and .nctmes format. 
The transformed data are stored at our Kent Ridge Biomedical Dataset Repository 
(http : //sdmc . i 2 r . a- star . edu. sg/rp/). 



3 A New Decision Tree Based Committee Method for Classification 

In this section, we first show that decision trees are a good and systematic approach to 
finding a small subset of biomarkers. Second, we use an example to show single decision 
trees are not always accurate in test phase though possessing perfect accuracy on training 
data. To strengthen the performance, we introduce a new committee method to combine 
single trees for reliable and accurate classification on test samples. 



3.1 Decision Tree: An Ideal Algorithm for Finding Biomarkers 



We use a different high-dimensional biomedical data set, a gene expression profiling 
data set, for demonstration. A reason for this is that the ovarian data set does not have 
an independent test data, but this gene expression data set has. This gene expression 
data set consists of 215 two-class training samples for differentiation between 14 MLL 
(mixed-lineage leukemia) subtype patients of childhood leukemia disease and 201 pa- 
tients suffering any other subtype of this disease [18]. The data are described by 12558 
features. Here each feature is a gene, having continuous expression values. This decision 
tree is constructed by C4.5 [16]. The structure of this tree is depicted in Figure 1. Ob- 
serve that there are only 4 features in this tree residing at the 4 non-leaf nodes, namely 
34306_at, 40033_at, 33244 _at, and 35604 _at. Compared to the total 12558 input features, 
the 4 built-in features in the tree is a tiny fraction. For interpretation, each of the five leaf 
nodes corresponds to a rule, the rule’s predictive term is the class label contained in the 
leaf node. 

So, this tree can be decomposed into 5 rules and these rules can be formed a function 
as follows. 

— 1 if a:i < a, X 2 < 6 



f{xi,X2,X3,X4) = < 



— 1 if a:i < a, X 2 > b, xs < c 
1 if a:i < a, X 2 > b, xs > c 
1 if > a, X 4 < d 

— 1 if xi > a, X 4 > d 



where xi, X 2 , 2 : 3 , and, X 4 represent gene variables 34306_at, 40033_at, 33244_at, and 
35604 _at, respectively; a = 13683.6, b = 3691.4, c = 986.9, d = 846.6; the two values 
of this function — 1 and 1 represent others and MLL respectively. So, given a test sample, 
at most 3 of the 4 genes’ expression values are needed to determine /(xi, X2, X3, X4). 
If the function value is —1, then the test sample is predicted as others, otherwise it 
is predicted as MLL. These 4 genes are biomarkers — simple and effective! We do not 
necessarily need the entire 12588 original features to make decisions. As seen later, there 
also exist similar small number of protein biomarkers in the ovarian cancer data set that 
can be discovered by decision tree based methods. 
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Fig. 1. A decision tree induced by C4.5 from gene expression data set for differentiating the 
subtype MLL against other subtypes of childhood leukemia. Here a = 13683.6, b = 3691.4, c = 
986.9, d = 846.6. 

3.2 An Introduction to a New Committee Method 

Single decision trees usually do not provide good accuracies on test data, especially 
when handling high-dimensional biomedical data such as gene expression profiling data 
or proteomic profiling data [14]. For example, on the data set discussed in the previous 
section, the tree made 4 mistakes on 1 12 test samples. A possible reason is that the greedy 
search heuristic confines the capability of the tree induction algorithm, only allowing 
the algorithm learn well on one aspect of the high-dimensional data. 

Next, we introduce our new committee method. It is called CS4* [13,12], and its 
basic idea to discover a committee of trees is to change root nodes using different 
top-ranked features. For example, to discover 10 trees, we use 10 top-ranked features 
each as the root node of a tree. Such trees are called cascading trees [12]. This method 
was motivated by the idea of ‘second-could-be-the-best’, and it is confirmed by the 
observation that there exist many outstanding features in high-dimensional data that 
possess similar classification merits with little difference. 

Table 1 summarizes the training and test performance, and the numbers of features 
used in 10 cascading trees induced from the MLL-others data set. Flere, the ith (1 < 
i < 10) tree is established using the zth top-ranked feature as root node. Observe that: 
(a) the 10 trees made similar numbers of errors on the training and test data; (b) the 5th, 
8th, or the 9th tree made a smaller number of errors than the first tree made; (c) the rules 
in these trees were very simple, containing about 2, 3 or 4 features; (d) none of them 
produced a perfect test accuracy. 

Next, we describe how our CS4 method combines these individual trees and how CS4 
eliminates those errors. The general method is as follows [13]. Suppose k trees being 
induced from a data set consisting of only positive and negative training samples. Given 
a test sample T, each of the k trees in the committee will have a specific rule to tell us a 
predicted class label for this test sample. Denote the k rales from the tree committee as: 
rulci , rule 2 , • • • , , and rulei , ruLe 2 , • • • , . Here k\ + k 2 = k. 

* CS4 is an acronym of Cascading-and-Sharing for ensembles of decision trees. 
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Table 1. The performance of 10 cascading trees on the MLL-others data set that consist of 215 
training and 112 test samples. 



Tree No. 


123456789 10 


Training errors 000000100 


0 


Test errors 


432326721 


6 


# of features 


444444444 


6 



Each of rule^°^ (1 < * < fci) predicts T to be in the positive class, while each of rule^^^ 
(1 < i < ^ 2 ) predicts T to be in the negative class. Sometimes, the k predictions can be 
unanimous — i.e., either fci = 0 or fc 2 = 0- In these situations, the predictions from all 
the k rules agree with one another, and the final decision is obvious and seemed reliable. 
Often, the k decisions are mixed with either a majority of positive classes or a majority 
of negative classes. In these situations, we use the following formulas to calculate two 
classification scores based on the coverages of these rules, i.e., the percentage of a class 
samples that are satisfied by these rules: ScoreP°‘^{T) = coverage{T, rule^°^), 

and Score^‘^^{T) = Yl^^^coverage{T,rule^‘^^). If ScoreP°^{T) is larger than 
Score^^^ {T), we assign the positive class to the test sample T. Otherwise, T is pre- 
dicted as negative. This weighting method allows the tree committee to automatically 
distinguish the contributions from the trivial rules and those from the significant rules 
in the prediction process. 

Let’s examine the performance of combined trees on the MLL-others data set. When 
combining the first 2 trees, the committee made 3 mistakes, one mistake less than that 
made by the sole first tree. When combining the hrst 4 trees, the committee did not make 
any errors on the 112 test samples, eliminating all the mistakes made by the first tree. 
Adding more trees into the committee till the 10th tree, the expanding committee still 
maintained the perfect test accuracy. 

From this example, we can see that the use of multiple trees can much improve the 
performance of single decision trees. In addition to this perfect accuracy, the committee 
of trees also provide a small subset of biomarkers. In fact, each decision tree contains 3 
or 4 features, so, the committee contains total 30 to 40 features. Compared to the original 
12558 features, the features contained in the committee is very small. As seen later, the 
CS4 learning algorithm can also achieve the two goals for the ovarian cancer data set. 



4 Three Non-linear Learning Algorithms 

We have seen in the last section that tree based methods are a systematic approach 
to finding a small subset of features that can be used as biomarkers. In this section, 
we explain why fc-nearest neighbour, support vector machines, and naive bayes cannot 
fulhll this purpose, and we also explain why genetic algorithm is not most relevant for 
this purpose. 

The fc-nearest neighbour [5] classifier is an instance-based algorithm, it does not 
need a learning phase. The simple intuition behind classification is that the class label 
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of a new sample X should agree with the majority of k nearest points of X if a training 
data set are viewed as a set of points in a high-dimensional Euclidean space. So, this 
classifier can only make a prediction, but cannot derive any biomarkers because it always 
depends on the whole input features for the calculation of distance. 

Support vector machine is a statistical learning algorithm, it is also called kernel- 
based learning algorithm [3,4]. Consider a two-class training data set having n number 
of original features xi, 2 : 2 , • • • , x„, a classifier induced from this training data by support 
vector machines is a function defined by: 



/(xi, . . . ,x„) 



— 1 otherwise 



where X,Yi sR", ai,b GR; ai,Yi,I and b are parameters and X is the sample to 
be classified whether it is 1 (normal) or —1 (abnormal). The SVM training process 
determines the entire parameter set {a^, Yi, /, 6}; the resulting Yi,i G / are a subset of 
the training set and are usually called support vectors. The kernel function K can have 
different forms. For example, K = {X ■ YiY implements a polynomial SVM classifier; 
K = tanh(X -Yi + 5) implements a two-layer neural network. The simplest form of 
the kernel functions is K = X ■ Yi = * Vij’ namely a n-dimensional linear 

function. 

Regardless of different types of kernels, SVM classifiers always have the same num- 
ber of variables as that in the input feature space. So, SVM learning algorithms them- 
selves do not derive any small subset of features in the learning phase. 

Naive Bayes [7,1 1] is a probabilistic learning algorithm based on the Bayes theorem 
and the assumption of class conditional independence. Naive Bayes assumes that the 
effect of a feature value on a given class is independent of the values of other features. 
We can also see that it is impossible for Naive Bayes itself to conduct feature selection 
and derive a small subset of biomarkers because the calculation of the probability crosses 
the whole feature space. 

Due to some randomization mechanism, the output of genetic algorithms is not stable 
depending machine, time, or even program coding languages. So, biomarkers identified 
by genetic algorithms may change from time to time simply because of the different 
randomization seeds. However, tree-based algorithms are stable. 



5 Accuracy on the Ovarian Cancer Data Set 

This section reports the 10-fold cross-validation results on the ovarian cancer data set 
using SVM, NB, fc-NN, decision trees, and CS4. In each iteration, error numbers on 
the test data by a learning algorithm is recorded; the overall performance is then the 
summation of error numbers in the 10 iterations. This evaluation method is much more 
reliable than the method used by Petricoin et al [15] that only relies on the performance 
on one small pre-reserved test samples. 

The results are shown in Table 2. The Weka software package 
(http : / /www . cs . Waikato . ac . nz/ml/weka/) and our in-house developed CS4 soft- 
ware are used to get the results. 
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Table 2. Errors in the 10-fold cross validations of 7 learning algorithms on the ovarian cancer data 
set. The symbol {x ; y) stands for x number of errors made in the ‘Cancer’ class and y number of 
errors made in the ‘Normal’ class. 



Algorithm 


CS4 (20 trees) 


C4.5 


Bagging (20 trees) Boosting 


SVM 


3-NN 


NB 


Errors 


0 


10 (4:6) 


7 (3:4) 


10 (4:6) 


0 


15 (5:10) 19(17:2) 


Accuracy 


100% 


96.0% 


97.2% 


96.0% 


100% 


94.1% 


92.5% 



From this table, we can see that both SVM and our CS4 algorithm can achieve the 
same perfect 100% 10-fold cross validation accuracy on this ovarian cancer data set. 
But they use different numbers of features. SVM always uses the whole 15154 features 
in the classification, while CS4 uses less than 100 features. So, for the early detection 
of ovarian cancer by mass spectrosmetry, our CS4 algorithm is better than SVM for 
the data analysis including the discovery of biomarkers from the blood samples. More 
results about biomarkers will be shown in the next section. For other fold (7-fold or 
5-fold in this paper) cross validation, CS4 made one mistake, SVM made one mistake in 
the 7-fold validation, but did not make any mistake in the 5-fold. So, their performance 
were agreed to each other very much. 

Like CS4, Bagging is also a committee method. In addition to using 20 trees in 
a committee, we also tried other numbers such as 30, 40, or 50 trees for Bagging. 
However, all these cases did not improve the Bagging’s performance (still making the 
same 7 mistakes). 

One possible way for non-linear learning algorithms to find a small subset of biomark- 
ers from high-dimensional data sets is the pre-selection of top-ranked features based on 
some criteria such as entropy [8]. However, we found that the performance of the non- 
linear algorithms decreased when only top 10, 20, 25, 30, 35, or 40 ranked features are 
used. This may be due to the strong bias on the use of only top-ranked features, and this 
also suggests that opening the whole feature space for consideration, as done by decision 
tree, is a fair approach to finding biomarkers. 



6 Biomarkers Identified from the Blood Samples 

Now that CS4 has the perfect 10-fold cross validation performance, we have strong 
reasons to believe that its identified biomarkers should be very useful for any test samples 
beyond the 253 samples. In the following, we present detailed results about biomarkers 
identified from the whole ovarian cancer data set. 

We generated 20 cascading trees by CS4. Each of the trees contains 2 to 5 features. 
In total, there are 72 features in this tree committee. These 72 features are just protein 
biomarkers that we can use for the early detection of ovarian cancer based on women’s 
blood samples. Mathematically, these biomarkers can form 92 simple rules. We list some 
of them in the following function: 
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p{xi,X 2 , ■■■,xr) 



1 if xi < a, X 2 < b 

— 1 if xi < a, X 2 > b, Xi < d 

1 if Xi < a, X 2 > b, Xi > d 

— 1 if xi > a, X 3 < c, X 5 < e 

< 1 if xi > a, 2:3 < c, X 5 > e 

— 1 if xi > a, X 3 > c 

fifxe<f,X 7 <g 

-1 ifxe < f,xr > g 
-1 if X 6 > / 



where xi, . . . ,xr are bio-markers, a = 0.435461, b = 0.611786, c = 0.277777, d = 
0.696293, e = 0.294503, / = 0.251969, g = 0.493934; the two values of this function 
— 1 and 1 represent Normal and Cancer respectively. So, given a new blood sample, at 
most 3 variables’ intensities are needed to determine p{xi,X2, . . . , X7). If the function 
value is — 1 , then the patient is diagnosed as Normal, otherwise it is predicted as Cancer. 



7 Discussion and Conclusion 

CS4 differs fundamentally from the state-of-the-art committee methods such as Bag- 
ging [1] and AdaBoosting [9]. Unlike them, our method always uses the original training 
data instead of bootstrapped, or pseudo, training data to construct a sequence of different 
decision trees. Though a bootstrapped training set is the same size as the original data, 
some original samples may no longer appear in the new set while others may appear 
more than once. So, rules produced by the Bagging or Boosting methods may not be 
correct when applied to the original data, but rules produced by CS4 reflect precisely 
the nature of the original training data. The bagging or boosting rules should therefore 
be employed very cautiously, especially in the applications of bio-medicine where such 
concerns could be critical. In addition to being different from Bagging and Boosting, 
CS4 also differs from other voting methods such as random forest [2] and random- 
ized decision trees [ 6 ] — they randomly select a best feature as root node from a set of 
candidates. 

Finally, we summarize the paper. We have applied a new tree-based committee 
method to a large ovarian cancer data set for the early detection of this disease and 
for finding a small subset of biomarkers from blood samples. The new method CS4 
has achieved the perfect 100% 10-fold cross validation accuracy on this data set. This 
method also identified 70 or so biomarkers from the 15154 candidates. Though SVM also 
achieved the same perfect accuracy, it could not directly derive a small set of biomarkers 
for the early detection of this disease. Taking this ovarian cancer data set, we have 
also demonstrated how to bridge the gap between contemporary cancer research and 
data mining algorithms. We also emphasize that patterns, rules, and mining algorithms 
should be easily acceptable to the biomedical fields. As a future work, the CS4 algorithm 
will be extended to data analysis for other specific diseases such as breast cancer, liver 
cancer, and prostate cancer to identify disease-specific biomarkers. 
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Abstract. Clustering analysis has been applied in a wide variety of fields. In 
recent years, it has even become a valuable and useful technique for in-silico 
analysis of microarray or gene expression data. Although a number of cluster- 
ing methods have been proposed, they are confronted with difficulties in the 
requirements of automation, high quality, and high efficiency at the same time. 
In this paper, we explore the issue of integration between clustering methods 
and validation techniques. We propose a novel, parameter-less, and efficient 
clustering algorithm, namely CST, which is suitable for analysis of gene ex- 
pression data. Through experimental evaluation, CST is shown to outperform 
other clustering methods substantially in terms of clustering quality, efficiency, 
and automation under various types of datasets. 



1 Introduction 

In recent years, clustering analysis has become a valuable and useful technique for in- 
silico analysis of microarray or gene expression data. The main goal of clustering 
analysis is to partition a given set of objects into homogeneous groups based on given 
features [1]. Although a number of clustering methods have been proposed [1], [3], 
[4], [8], [11], they are confronted with some difficulties. First, most clustering algo- 
rithms request users to specify some parameters. In real applications, however, it is 
hard for biologists to determine the suitable parameters manually. Thus an automated 
clustering method is required. Second, most clustering algorithms aim to produce the 
clustering results based on the input parameters and their own criterions. Hence, they 
are incapable of producing optimal clustering result. Third, the existing clustering 
algorithms may not perform well when the optimal or near-optimal clustering result is 
enforced from the universal criterions. 

On the other hand, a variety of clustering validation measures are applied to evalu- 
ate the validity of the clustering results, the suitability of parameters, and the reliabil- 
ity of clustering algorithms. A number of clustering methods have been proposed, 
such as DB-index [2], Simple matching coefficient, [6], Jaccard coefficient, [6], 
Hubert’s F statistic, [6], FOM [10], ANOVA [7], VCV [5], et al. Nevertheless, the 
roles of them are placed only on the phase of “post-validation”, with the exception of 
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Smart-CAST [9]. The study of how validation techniques can help clustering methods 
performance well has been strangely neglected. 

In this paper, we focus on the integration between clustering methods and valida- 
tion techniques. We propose a novel, parameter-less, and efficient clustering algo- 
rithm that is fit for analysis of gene expression data. The proposed algorithm deter- 
mines the best grouping of genes on-the-fly by using Hubert’s T statistic as validation 
measurement during clustering without any user-input parameter. The experiments on 
synthetic microarray datasets showed that the proposed method can automatically 
produce “nearly optimal” clustering results in very high speed. 

The rest of the paper is organized as follows: In Section 2, we describe the pro- 
posed method, namely Correlation Search Technique (CST) algorithm. In Section 3, 
the empirical evaluation results are presented. Finally, the conclusions are drawn in 
Section 4. 



2 Correlation Search Technique 

For the proposed Correlation Search Technique (CST) algorithm, the original dataset 
must be transformed into a similarity matrix S. The matrix S stores the degree of 
similarity between every pair of genes in the dataset, with the range of similarity 
degree in [0, 1]. The similarity can be obtained by various measurements like Euclid- 
ean distance, Pearson’s correlation coefficient, etc. CST will automatically cluster the 
genes according to the matrix S without any user input parameters. 



2.1 Basic Principles 



The CST method integrates clustering method with validation technique so it can 
cluster the genes quickly and automatically. By embedding the validation technique, 
CST can produce a “near-optimal” clustering result. The validation measure we used 
here is Hubert’s F statistic [6]. Let X=[X{i,j)] and Y=[Y{i,j)] be two nx n matrices on 
the same n genes, where X{i, j) indicates the similarity of genes i and j, and Y(i, j) is 
defined as follows: 

f 1 if genes i and / are clustered in the same cluster, 

= otherwrse. 



The Hubert ’s F statistic represents the point serial correlation between the matrices 
X and Y, and is defined as follows if the two matrices are symmetric: 



1 « - 1 n 

1 y y 


r X (/, 7) - X ^ 


( Y(i, 7) - T ^ 


M t-i 




[ ar j 



where M = n (n - l)/2is the number of entries in the double sum, and Ox and Oy 
denote the sample standard deviations. X and Y denote the sample means of the 
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entries of matrices X and Y. The value of T is between [-1, 1] and a higher value of T 
represents the better clustering quality. 

To reduce the time cost in computing T statistic, we simplify it as follows: 



r = 



n-1 



n— 1 n 



\ 









2 



(3) 



Here, the square of Y{i, j) can be ignored since the value of Y{i, j) is either zero or 
one. Furthermore, similarity X{i, j) of same dataset is invariable regardless that the 
clustering result is variable. Accordingly, the measurement T' indicating the quality 
of clustering result is 



MX 'ZX(iJ)Y(i,j)-Y^ X ^('M')X X^('’^') 

t = l _/=/+l / = ! j = i + \ j = l j = i + \ 



MX X 

1 = 1 ; = / + l 1 = 1 j = i + \ 



(4) 



2.2 CST Method 

The input of CST is a symmetric similarity matrix X, where X(i, j)e [0, 1]. CST is a 
greedy algorithm that constructs clusters one at a time, and the currently constructed 
cluster is denoted by Each cluster is started by a seed and is constructed incre- 
mentally by adding (or removing) elements to (or from) one at a time. The tem- 
porary clustering result in each addition (or removal) of x is computed by simplistic 
Hubert’s T statistic, i.e. equation (4), and is denoted by Tadd(-^) (or T^_,^^^^(x)). In addi- 
tion, the current maximum of simplistic Hubert’s T statistic is denoted by Tn,ax. We 
say that an element x has high positive correlation if Tadd(^) ^ Tmax, and x has high 
negative correlation if remove (^) ^ CST takes turns between adding high positive 
correlation elements to and removing high negative correlation elements from it. 
When is stabilized by addition and removal procedure, this cluster is complete 
and next one is started. 

To reduce computing time, we simplify equation (4) further. Considering each ad- 
dition (or removal) stage, the summation of Y{i, j) is equivalent, no matter which 
element is chosen. Consequently, (4) is abridged in deciding the element to be added 
to (or removed from) and the measurement T" of effect of each added (or re- 
moved) element is 

i=l j=i-Yl 



(5) 
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Input; An n-by-n similarity matrix X 

0. Initialization: 

M = n {n - 1) / 2 

i=l j=i+\ 

Sy=Q 
Xr=0 

C = 0 

U= {1, 2, 

r =0 

max 

1. while (1/ 0) do 

C,.„ = 0 

a (•) = 0 

1.1. SEED: Pick an element u e U with most neighbors 

U = U - {u} /* Remove u from U */ 

For all i e U set a{i) = X{u, i) /* Update the affinity */ 

= {«) /* Insert u into C , */ 

1.2. ADD: while MaxValidaty( ) do 

Pick an element u e U with maximum a(*) 

U = U - {u} /* Remove u from U */ 

Sy=Sy+ \C^^J 

SxY = Syy + a{u) 

For all i e U \J set a{i) = a{i) + X{u, i) /* Update the affinity */ 

C „ = C „ U { u } /* Insert u into U „ */ 

open open i ’ open 

= MaxValidaty( ) 

1.3. REMOVE: while MaxValidaty( ) > U do 

Pick an element v 6 C with minimum a(*) 

C.pen = Cope,. - { V 1 /* Remove v from */ 

Sy= Sy- |C„p„j 
^xr ~ ^xr 

For all i e U \J set a(i) = a(i) - X(u, i) /* Update the affinity*/ 

U = {/ U { V } /* Insert v into U */ 

= MaxValidatyO 

1.4. Repeat steps ADD and REM0VE as long as there are no elements been 
removed. 

1.5. C=CU {C^J 

end 

2. Done, return the collection of cluster, C. 



I* The collection of closed clusters */ 

/* Elements not yet assigned to any cluster */ 



Fig. 1. The pseudo-code of CST 

The pseudo-code of CST is as shown in Fig. 1. The subroutine MaxValidaty(*) 
computes the maximal value of measurement F' in adding (or removing) a certain 
element. In the addition stage, it is equal to 
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(M ^ (SxY + max{ a(u) \ u e U]) - Sx* (Sr+ \ Copen |)) 

(&+ I Copen |)-(5r+|C open |) 

For the removal stage, it becomes 

(M * (SxY - min{ a(v) | V e Copen}) -Sx* (Sy- \ Copen I + 1)) 
■^M * (Sy- I Copen I +1) - (*- I Copen | +1)' 



3 Experimental Evaluation 

In order to evaluate the performance and accuracy of our CST method, we design a 
correlation-inclined cluster dataset generator. Initially we generate four seed sets with 
size three, six, five, and ten, respectively, and the seeds in each set have correlation 
coefficient less than 0.1 with 15 dimensions. Then, we load these four seed sets into 
the generator to generate two synthetic datasets for testing, respectively, called Da- 
taset I and Dataset II. The Dataset I contains three main clusters with size 900, 700, 
and 500, and 400 additional outliers, while the Dataset II contains six main clusters 
with size 500, 450, 400, 300, 250, and 200, and 400 additional outliers. 

We compare the proposed method CST with the well-known clustering method, 
namely k-means [5], CAST-FI and Smart-CAST [9]. For k-means, the value of k was 
varied from 2 to 21 and from 2 to 41 in increment of 1, respectively. For CAST-FI, 
the value of affinity threshold t was varied from 0.05 to 0.95 in fixed increment of 
0.05. The quality of clustering results was measured by Hubert’s F statistic. Simple 
matching coefficient [6], and Jaccard coefficient [6]. We also use intensity image [5] 
to exhibit more information produced by the clustering methods. 

Table 1 shows the total execution time and the best clustering quality of the tested 
methods on Dataset I. The notation “M” indicates the number of main clusters pro- 
duced. Here we consider clusters with size smaller than 50 as outliers. We observe 
that CST, Smart-CAST and CAST-FI outperform k-means substantially in both of 
execution time and clustering quality. In particular, our approach performs 396 times 
to 1638 times faster than k-means on Dataset I. In addition, the results also show that 
the clustering quality generated by CST is very close to that of Smart-CAST and 
CAST-FI for measurements of Hubert’s F statistic. Simple matching coefficient, and 
Jaccard coefficient. It means that the clustering quality of CST is as good as Smart- 
CAST and CAST-FI even that the computation time of CST is reduced substantially. 



Table 1. Experimental results obtained by applying the tested methods to Dataset I 



Methods 


Time (s) 


# Clusters 


F Statistic 


Matching 


Jaccard 


CST 


< I 


65 (M=3) 


0.800 


0.981 


0.926 


Smart CAST 


12 


85 (M=3) 


0.800 


0.986 


0.944 


CAST-FI 


54 


91 (M=3) 


0.799 


0.986 


0.945 


k-means (k=2-21) 


396 


6 (M=6) 


0.456 


0.825 


0.427 


k-means (k=2-41) 


1638 


6 (M=6) 


0.456 


0.825 


0.427 
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(a) (b) (c) 

Fig. 2. Intensity images on Dataset I. (a) The prior cluster structure, (b) Clustering results of 
CST. (c) Clustering results of CAST 



Table 1 also shows that k-means produces 6 main clusters with size as 580, 547, 
457, 450, 352, and 114 for the best clustering result. This does not match the real 
cluster structure. In contrast, CST produces 65 clusters as the best clustering result 
with 3 main clusters of size 912, 717, and 503. This matches the prior cluster struc- 
ture very well. The same observation applies to Dataset 11. Moreover, CST also gen- 
erates a number of clusters with small size, which are mostly outliers. This means that 
CST is superior to k-means in filtering out the outliers from the main clusters. 

The intensity images of the prior cluster structure and the clustering results on Da- 
taset I are shown in Fig. 2. From Fig. 2(a), we observe easily that there are three main 
clusters and quite a number of outliers in Dataset 1. It is also observed that Fig. 2(b) 
and Fig. 2(c) are very similar to Fig. 2(a), meaning that the clustering results of CST 
and Smart-CAST are very similar to the real cluster structure of Dataset I. The above 
observations also apply to Dataset II. 



4 Conclusions 

In this paper, we focus on the integration between clustering methods and validation 
techniques. We propose a novel, parameter-less, and efficient clustering algorithm, 
called CST, for analysis of gene expression data. CST clusters the genes via Hubert’s 
F statistic on the fly and produces a “nearly optimal” clustering result. Performance 
evaluations on synthetic gene expression datasets showed that CST method can 
achieve higher efficiency and clustering quality than other methods without request- 
ing the users to setup parameters. Therefore, CST can provide high degree of auto- 
mation, efficiency and clustering quality, which are lacked in other clustering meth- 
ods for gene expression mining. In the future, we will use real microarray datasets to 
evaluate the validity and efficiency of the CST method. 
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Abstract. This paper describes a method of clustering lists of genes 
miired from a microarray dataset using functioiral information from 
the Gene Ontology. The method uses relationships between terms in 
the ontology both to build clusters and to extract meaningful cluster 
descriptions. The approach is general and may be applied to assist 
explanation of other datasets associated with ontologies. 

Keywords: Cluster analysis, bioinformatics, cDNA microarray. 



1 Introduction 

Rapid developments in measurement and collection of diverse biological and clin- 
ical data offer researchers new opportunities for discovering relations between 
patterns of genes. The “classical” statistical techniques used in bioinformatics 
have been challenged by the large number of genes that are analysed simultane- 
ously and the curse of dimensionality of gene expression measurements (in other 
words we are looking typically at tens of thousands of genes and only tens of 
patients) . Data mining is expected to be able to assist the bio-data analysis (see 
[1] for brief overview). 

The broad goals of this work are to improve the understanding of genes 
related to a specific form of childhood cancer. Three forms of data are combined 
at different stages. Patient data include cDNA microarray and clinical data for 
9 patients. Usually between 2 and 10 repeat experiments of the same data (ie. 
patient) are made. For each patient, there are around 9000 genes with between 

2 and 10 log ratios (ie. experiment repeats) for each gene. Clinical data describe 
a patient in detail, as well as the effect of different treatment protocols. Of the 
nine patients, 4 are labelled as high risk. 

The task is to assist in understanding gene patterns in such biodata. Pro- 
posed methodology is shown in Fig. 1. It includes 3 stages. Stage 1 (“DM1: 
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Fig. 1. Diagram showing methodology used to analyse microarray data 



extract” ) is a data mining cycle, which reduces the vast number of genes coming 
from the microarray experiments to dozens of genes. Techniques used are de- 
scribed in detail in [2]. The output of this stage is interesting from a statistical 
point of view, however it is difficult for biological interpretation. Stage 2 (“DM2: 
explain”) aims at assisting the interpretation of these outputs. The list of genes 
is reclustered over a gene ontology [3] into groups of genes with similar biolog- 
ical functionality. Descriptions of the clusters are automatically determined for 
biological interpretation. Stage 3 (“DM3: generate hypotheses”) aims to sum- 
marise what is known about the genes and to group them in the context of the 
microarray measurements. Biologists then can formulate potentially promising 
hypotheses and may return to Stage 1. 



2 DM2: Assisting Biological Explanation 

The focus of this paper is on Stage 2. The cluster analysis and visualisation 
described in this paper takes as input (i) a list of genes highlighted from “DM1: 
extract” and (ii) data from the Gene Ontology. Clustering data according to an 
ontology is a new procedure described in [4]. It entails using a special distance 
measure that considers the relative positions of terms in the ontological hier- 
archy. The particular clustering algorithm is not as important as the distance 
measure. Details of the algorithm are presented in [4]. Recent work in [5] takes 
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Table 1. The first few rows of the dataset for the second step in the methodology. 



Gene 



GO terms directly associated with gene 



AA040427 G0:0004715 G0:0005524 G0:0004674 G0:0006468 G0:0008283 G0:0000074 
G0:0005634 G0:0016740 

AA046690 G0:0003777 G0:0005524 G0:0007018 G0:0005871 
AA055946 G0:0004894 G0:0005057 G0:0004888 G0:0007166 G0:0006968 
G0:0005887 



Table 2. Discovered clusters. AAnnnn are GenBank accession codes. 



Cluster Gene Genes 

Number Count 



0 6 AA040427 AA406485 AA434408 AA487466 AA609609 

AA609759 

1 2 AA046690 AA644679 

2 6 AA055946 AA398011 AA458965 AA487426 AA490846 

AA504272 

3 9 AAl 12660 AA397823 AA443547 AA447618 AA455300 

AA478436 AA608514 AA669758 AA683085 

4 20 AA126911 AA133577 AA400973 AA464034 AA464743 

AA486531 AA488346 AA488626 AA497029 AA629641 
AA629719 AA629808 AA664241 AA664284 AA668301 
AA669359 AA683050 AA700005 AA700688 AA775874 



a similar approach. We use the Gene Ontology [3], a large collaborative public 
set of controlled vocabularies, in our clustering experiments. Gene products are 
described in terms of their effect and known place in the cell. Terms in the on- 
tology are interrelated: eg. a “glucose metabolism” is a “hexose metabolism”. 
Gene Ontology terms are associated with each gene in the list by searching in 
the SOURGE database [6] . The list of genes is clustered into groups with similar 
functionality using a distance measure that explicitly considers the relationship 
between terms in the ontology. Finally, descriptions of each cluster are found by 
examining Gene Ontology terms that are representative of the cluster. 

Taking the list of genes associated with high risk patients identified in Stage 1 
(an example of such genes are shown in the first column in Table 1), we reclus- 
tered them using terms in the Gene Ontology (the GO'.nnnnnnn labels in the 
right column in Table 1) into groups of similarly described genes. 

3 Results of DM2 

Five clusters are found as shown in Table 2. Half of the genes have been allocated 
to one cluster. The rest of the genes have been split into four smaller clusters 
with one cluster containing only two genes. 

Associated GO terms automatically determine functional descriptions of clus- 
ters. Starting with all the GO terms directly associated with genes in a particular 



702 P.J. Kennedy et al. 



Table 3. Principal cluster descriptions for the genes. Last column is the number of 
genes in the cluster associated with the term. 



GO ID 


GO Term 


Number 
of Genes 


Cluster 0 — 6 genes 




20 GO terms but each associated with only one gene 


1 


Cluster 1 — 2 genes 


00:0008092 


cytoskeletal protein binding activity 


2 


00:0007028 


cytoplasm organization and biogenesis 


2 


00:0003774 


motor activity 


2 


00:0005875 


microtubule associated complex 


2 




5 GO terms but each associated with only one gene 


1 


Cluster 2 — 6 genes 


00:0004871 


signal transducer activity 


4 


00:0007154 


cell communication 


4 


00:0005887 


integral to plasma membrane 


3 


00:0005886 


plasma membrane 


3 


00:0005194 


cell adhesion molecule activity 


2 




11 GO terms but each associated with only one gene 


1 


Cluster 3 — 9 genes 


00:0030528 


transcription regulator activity 


4 


00:0008134 


transcription factor binding activity 


3 


00:0006366 


transcription from Pol II promoter 


3 


00:0003700 


transcription factor activity 


3 


00:0006357 


regulation of transcription from Pol II promoter 


3 




5 GO terms but each associated with only two genes each 


2 




13 GO terms but each associated with only one gene 


1 


Cluster 4 — 20 genes 


00:0003723 


RNA binding activity 


10 


00:0030529 


ribonucleoprotein complex 


9 


00:0009059 


macromolecule biosynthesis 


9 


00:0006412 


protein biosynthesis 


9 


00:0005829 


cytosol 


9 


00:0003735 


structural constituent of ribosome 


8 




2 GO terms but each associated with only four genes each 


4 




5 GO terms but each associated with only three genes each 


3 




1 GO term associated with only two genes 


2 




33 GO terms but each associated with only one gene 


1 



cluster, we climb the ontology replacing GO terms with their parents. Terms are 
replaced only if the parent node is not associated with genes in another cluster. 
Cluster descriptions derived in this way are shown in Table 3. Only the is-a 
relationships were followed to build this table. There are far fewer part~of re- 
lationships in the hierarchies so we do not believe that omitting them affects 
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the results. The terms listed in the table are associated only with genes in each 
cluster and not in any other cluster. Cluster 0 in Table 3 has no terms that are 
associated with more than one gene. This suggests that the genes in the cluster 
are either unrelated or related only in ways that are sufficiently high level that 
the terms exist in other clusters. This suggests that the quality of the cluster 
is not good. Cluster 1 contains at least two genes that are related to the cell 
cytoskeleton and to microtubules (ie. components of the cytoskeleton) . Cluster 
2 contains three or four genes associated with signal transduction and cell sig- 
nalling. Cluster 3 contains three or four genes related to transcription of genes 
and cluster 4 contains genes associated with RNA binding. 

4 Conclusions 

We present a methodology for extracting and explaining biological knowledge 
from microarray data. Applying terms from the Gene Ontology brings an un- 
derstanding of the genes and their interrelationships. Currently biologists search 
through such lists gene-by-gene analysing each one individually and trying to 
piece together the many strands of information. Automating the process, at least 
to some extent, allows biologists to concentrate more on the important relation- 
ships rather than the minutiae of searching. Consequently they are enabled to 
formulate hypotheses to test in future experiments. The approach is general and 
may be applied to assist explanation other datasets associated with ontologies. 
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Abstract. Previous work introduced a 3D particle visualization framework that 
viewed each data point as a particle affected by gravitational forces. We showed 
the use of this tool for visualizing cluster results and anomaly detection. This 
paper generalizes the particle visualization framework and demonstrates further 
applications such as determining the number of clusters and identifies clustering 
algorithm biases. We don’t claim visualization itself is sufficient in answering 
these questions. The methods here are best used when combined with other 
visual and analytic techniques. We have made our visualization software that 
produces standard VRML available to allow its use for these and other 
applications. 



1 Introduction 

Clustering is one of the most popular functions performed in data mining. 
Applications range from segmenting instances/observations for target marketing, 
outlier detection, data cleaning and as a general purpose exploratory tool to 
understand the data. Most clustering algorithms essentially are instance density 
estimation and thus the results are best understood and interpreted with the aid of 
visualization. In this paper, we extend our particle based approach to visualizing 
clustering results [2][3] to facilitate the diagnosis and presentation processes in 
routine clustering tasks. 

Our earlier work describes a general particle framework to display a clustering 
solution [1] and illustrates its use for anomaly detection and segmentation [2]. The 
three-dimensional information visualization represents the previously clustered 
observations as particles affected by gravitational forces. We map the cluster centers 
into a three-dimensional cube so that similar clusters are adjacent and dissimilar 
clusters are far apart. We then place the particles (observations) amongst the centers 
according to the gravitational force exerted on the particles by the cluster centers. A 
particle's degree of membership to a cluster provides the magnitude of the 
gravitational force exerted. 



Our software is available at www.cs.albany.edu/~davidson/ParticleViz. We strongly 
encourage readers to refer the 3D visualizations at the above address while reading the paper. 
The visualizations can be viewed by any internet browser with a VRML plug-in 
(http://www.parallelgraphics.com/products/downloads/). 
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We focus on further applications of the visualization technique in clustering, 
addressing questions that arises frequently in routine clustering tasks: 1) Deciding the 
appropriate number of clusters. 2)Understanding and visualizing presentational bias 
and stability of different clustering algorithms. 

The rest of the paper is organized as following. We first introduce our clustering 
visualization methodology and then describe our improvements to achieve better 
visualization quality for comparing algorithms. We then demonstrate how to utilize 
our visualization technique to solve the example applications in clustering mentioned 
above. Finally we define our future research work direction and draw conclusions. 



2 Particle Based Clustering Visualization Algorithm 

The algorithm takes a k^k cluster distance matrix C and a k><N membership matrix P 
as the inputs. In matrix C, each member cy denotes the distance between the center of 
cluster i and the center of cluster j. In the matrix P, each member py denotes the 
probability of instance i belongs to cluster j. The cluster distance matrix may contain 
the KuIIback Leibler (EM algorithm) or Euclidean distances (K-Means) between the 
cluster descriptions. We believe the degree of membership matrix can be generated by 
most clustering algorithms, however it must be scaled so that sum of Zy Py = 1 ■ 

The algorithm first calculates the positions of the cluster centers in the three 
dimensional space given the cluster distance matrix C. A Multi-Dimensional Sealing 
(MDS) based simulated annealing method maps the clusters centers from higher 
dimensions into a three dimensional space while preserving the cluster distances in 
the higher dimensional instanee space. After the cluster centers are placed, the 
algorithm puts each instance around its closest cluster center aecording to its degree 
of membership at a distance of r.. = J {p .. ) . Function / is called the probability- 
distance transformation function, whose form we will derive in the next section. The 
exact position of the instance on the sphere shell is finally determined by the 
remaining clusters gravitational pull on the instance based on the instances degree of 
membership to them. Precise algorithmic details are provided in earlier work [2]. 



3 The Probability-Distance Transformation Function 

We need a mapping function between p (degree of membership) and r (distance to 
cluster center) to convey the density of instances correctly in the visualization. Let 
N{r) be the number of instances assigned to a cluster that are within distance r of its 
center in the visualization. If p is the degree of membership of an instance to a cluster, 
then the instance density function Z against the degree of membership is defined by: 

_ dN(r) dN(r) dr 

Z = — . (1) 

dp dr dp 
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The Z function measures the number of instances that will occupy an interval of size 
dp. While the D function measures the number of instances that will occupy an 
interval of size dr in the visualization. 



1 dN{r) 
r^dr 



( 2 ) 



We wish the two density functions measured by degree of membership and 
measured by the distance in the visualization to be consistent. To achieve this we 
equate the two and bound them so they only differ by a positive scaling constant c\ 
By solving the differential equation, we attain the probability-distance function / 






( 3 ) 



The constant c is termed the constricting factor which can be used to zoom in and 
out. Equation (3) can be used to determine how far to place an instance from a cluster 
center as a function of its degree of membership to the cluster. 



4 An Ideal Cluster’s Visual Signature 

The probability-distance transformation function allows us to convey the instance 
density in a more accurate and efficient way. By visualizing the clusters we can tell 
directly whether the density distribution of the cluster is consistent with our prior 
knowledge or belief The desired cluster density distribution from our prior 
knowledge is called an ideal density signature. For example, a mixture model that 
assumes independent Gaussian attributes will consist of a very dense cluster center 
with the density decreasing as a function of distance to the cluster center such as in 
Fig. 1. The ideal visual signature will vary depending on the algorithm’s assumptions. 
These signatures can be obtained by visualizing artificial data that completely abides 
by the algorithm’s assumptions. 




Fig. 1. The Ideal Cluster Signature For an Independent Gaussian Attribute 



Further Applications of a Particle Visualization Framework 



707 



5 Example Applications 

In this section, we demonstrate three problems in clustering we intend to address 
using our visualization technique. We begin by using three artificial datasets: (A) is 
generated from three normal distributions, N (-8,1), N (-4,1), N (20,4) with each 
generating mechanism equally likely. (B) is generated by three normal distributions, 
N (-9,1), N (-3,1), N (20,4) with the last mechanism twice as likely as the other two. 
(C) is generated by two equally likely normal distributions, N (-3, 9), N (3, 9). 



5.1 Determining the K Value 

Many clustering algorithms require the user to apriori specify the number of clusters, 
k, based on information such as experience or empirical results. Though the selection 
of k can be made part of the problem by making it a parameter to estimate [1], this is 
not common in data mining applications of clustering. Techniques such as Akaike 
Information Criterion {AIC) and Schwarz’s Bayesian Information Criterion (BIC) are 
only applicable in probabilistically formulated problems and often give contradictory 
results. The expected instance density given by the ideal cluster signature plays a key 
role in verifying the appropriateness of the clustering solution. 

Our visualization technique helps to determine if the current k value is appropriate. 
As we assumed our data was Gaussian distributed then good clustering results should 
have a signature density associated with this probability distribution shown Fig. 1 . 

Dataset (A) is clustered with the K-means algorithm with various values of k. The 
ideal value of k for this data set is 3. We start with A^2, shown in Fig. 2. The two 
clusters have quite different densities: The cluster on the left has an almost uniform 
density distribution that indicates the cluster is not well formed. In contrast, the 
density of right cluster is indicative of a Gaussian distribution. This suggests that the 
left-hand-side cluster may be further separable and hence we increase A: to 3 as shown 
in Fig. 3. We can see the density for all clusters approximate the Gaussian distribution 
(the ideal signature), and that two clusters overlap (bottom left). At this point we 
conclude that k=3 is a candidate solution as we assumed the instances were drawn 
from a Gaussian distribution and our visualization gives consistent cluster signature. 
For completeness, we increased k to 4 the results are shown in Fig. 4. Most of the 
instances on the right are part of two almost inseparable clusters whose density is not 
consistent with the Gaussian distribution signature. 




Fig. 2. Visualization of dataset (A) clustering results with K-means algorithm (k=2). 
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Fig. 3. Visualization of dataset (A) clustering results with K-means algorithm (k=3) 




Fig. 4. Visualization of dataset (A) clustering results with K-means algorithm (k=4) 



5.2 Comparison Clustering Algorithms 

In this sub-section we describe how our visualization technique can be used to 
compare different clustering algorithms on their representational biases and 
algorithmic stabilities. Most clustering algorithms are sensitive to initial 
configurations and different initializations lead to different cluster solutions, which is 
known as the algorithmic stability. Also, different clustering algorithms have different 
representations of clustering, and the representational biases also contribute to the 
clustering solutions found. Though analytical studies that compare very similar 
clustering algorithms such as K-means and EM exist such studies are difficult for 
fundamentally different classes of clustering algorithms. 

Algorithmic Stability. By randomly restarting the clustering algorithm and 
clustering the clustering solutions and using our visualization technique we can 
visualize the stability of a clustering algorithm. We represent each solution by the 
cluster parameter estimates (centroid values for K-Means for example). The number 
of instances in a cluster indicates the probability of these particular local minima (and 
its slight variants) being found, while the size of the cluster suggests the basin of 
attraction associated with the local minima. 

We use dataset (B) to illustrate this particular use of the visualization technique. 
Dataset (B) is clustered using k=3 with three different clustering algorithms: weighted 
K-means, weighted EM, and unweighted EM. Each algorithm makes 1000 random 
restarts thereby generating 1000 instances. These 1000 instances represent the 
different clustering solutions found, are separated into 2 clusters using a weighted K- 
means algorithm. We do not claim k=2 is optimal but will serve our purpose of 
determining the algorithmic stabilities. The results for all three algorithms are shown 
in Fig. 5, Fig. 6. and Fig. 7. We find the stability of the K-means algorithm is less 
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than that of EM algorithms, and that the stability of unweighted EM is less than the 
weighted version of the algorithm, which is consistent with the known literature [4], 







Fig. 5. Visualization of dataset (B) clustering solutions with unweighted EM 









Fig. 6. Visualization of dataset (B) clustering solutions with weighted EM 







Fig. 7. Visualization of dataset (B) clustering solutions with weighted K-means 

Visualizing Representational Bias. We are going to use dataset (C) to illustrate how 
to visualize the representational bias effect on the found clustering solutions. We do 
clustering with both the EM and K-means algorithm and show the visualized results 
in Fig. 8. and Fig. 9. 

The different representational biases of K-means and EM can be inferred from the 
visualizations. We found that for EM, almost all instances exterior to cluster’s main 
body are attracted to the neighboring cluster. In contrast, only about half of the 
exterior cluster instances for K-Means are attracted to the neighboring cluster. The 
other half are too far from the neighboring cluster to show any noticeable attraction. 
This confirms the well known belief that K-Means finds cluster centers that are 
further apart, have smaller standard deviations and less well defined than EM for 
overlapping clusters. 




Fig. 8. Visualization of dataset (C) clustering results with EM algorithm (k=2) 
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Fig. 9. Visualization of dataset (C) clustering results with K- means algorithm (k=2) 



6 Future Work and Conclusions 

Visualization techniques provide considerable aids in clustering problems as the 
problem focuses on instance density estimation. We focused on three visualization 
applications in clustering. 

• The cluster signature indicates the quality of the clustering results thus can be 
used for clustering diagnosis such as finding the appropriate number of clusters 

• Different clustering algorithms have different representational biases and 
different algorithmic stability. Although sometimes they can be analyzed 
mathematically, visualization techniques facilitate the process and provide aids in 
detection and presentation. 

We do not claim that visualizations can solely address these problems but believe 
that in combination with analytic solutions can provide more insight. Similarly we are 
not suggesting that these are the only problems the technique can address. The 
software is freely available for others to pursue these and other opportunities. 

We can use our approach to generate multiple visualizations of the output of 
clustering algorithms and visually compare them side by side. We intend to 
investigate adapting our framework to visually comparing multiple algorithms’ output 
on the one canvas. 
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